Introduction¶
The quantitative finance landscape is increasingly defined by high-dimensional datasets, where the number of potential predictive features, such as firm fundamentals, technical indicators, and macroeconomic data far exceeds the number of observable time periods. Traditional statistical methods, which often rely on assumptions of normality and low-dimensionality, struggle in this "large p, small n" paradigm, leading to models that are prone to overfitting and poor out-of-sample performance (Fan et al. 291). In this challenging environment, machine learning (ML) offers a robust toolkit for extracting meaningful signals from financial noise, moving beyond simplistic linear assumptions to capture complex, non-linear patterns.
This project, undertaken by Team Alpha, serves as a strategic marketing handbook to demonstrate the value of three cutting-edge ML methodologies for developing robust trading strategies. We have selected one technique from three distinct ML categories to showcase a diverse yet complementary analytical arsenal: Elastic Net regression (Category 1), K-Means clustering (Category 2), and Classification trees (Category 4). Each method addresses a fundamental challenge in quantitative finance.
First, Elastic Net regression combines the variable selection property of LASSO with the regularization stability of Ridge regression. This hybrid approach is particularly effective for factor selection and return prediction in high-dimensional settings where predictors are often highly correlated, a common issue in financial datasets (Zou and Hastie 301). We will illustrate its application in constructing a sparse yet stable predictive model from a large set of asset characteristics.
Second, K-Means clustering provides an unsupervised learning framework for identifying latent structure within markets. By grouping assets into clusters based on their return characteristics or other features, quants can discover distinct market regimes, identify asset universes for pairs trading, or develop diversification strategies that are sensitive to the underlying correlation structure of the market (Kaufman and Rousseeuw 10). This technique moves beyond pre-defined sectors to allow data-driven asset classification.
Finally, Classification trees offer a highly interpretable, non-linear model for categorical outcomes, such as predicting the direction of stock price movements (up or down). By recursively partitioning the feature space based on simple, intuitive rules, classification trees can capture complex interaction effects between predictors without requiring prior specification of the functional form (Breiman et al. 57). This makes them exceptionally useful for creating rule-based trading signals.
The primary objective of this handbook is threefold: (1) To provide a clear, theoretical foundation for each methodology, detailing their advantages, mechanics, and ideal use cases; (2) To offer practical, reproducible computational examples using Jupyter notebooks that demonstrate the implementation and tuning of each model on financial data; and (3) To synthesize these insights into a compelling "Marketing Alpha" section, illustrating how the synergistic application of these techniques can generate a sustainable edge.
Elastic Nets¶
Basics¶
Elastic Net is a regularized linear regression model that combines the penalties of both LASSO (L1 regularization) and Ridge (L2 regularization) regression. It is designed to overcome the limitations of each method when used in isolation (Zou and Hastie 301).
It is a supervised learning algorithm used for regression tasks. It is particularly effective when dealing with datasets where the number of features ($p$) is large compared to the number of observations ($n$), or when features exhibit high multicollinearity (high correlation between independent variables).
In essence, Elastic Net seeks to minimize the following objective function: $$ \min( \| \mathbf{y} - \mathbf{X}\beta \|^2 + \lambda_2 \| \beta \|^2 + \lambda_1 \| \beta \|_1 )$$
Where the first term is the standard least squares loss, the second term is the Ridge penalty (squared magnitude of coefficients), and the third term is the LASSO penalty (absolute magnitude of coefficients).
Advantages¶
The Elastic Net methodology offers several key benefits:
- Handles Multicollinearity: Unlike LASSO, which tends to arbitrarily select one feature from a group of correlated ones and discard the others, the Ridge component of Elastic Net allows correlated features to be grouped together, with their coefficients being shrunk similarly (Zou and Hastie 302).
- Variable Selection and Regularization: It inherits the variable selection property from LASSO, which can drive coefficients of irrelevant features to zero, creating a sparse model. Simultaneously, it inherits the stability and regularization strength of Ridge regression, which prevents overfitting by shrinking the coefficients.
- Superior Performance in "Large p, Small n" Scenarios: In high-dimensional data settings (common in quantitative finance, e.g., predicting returns with hundreds of factors), vanilla linear regression fails. Elastic Net is explicitly designed to perform well in these scenarios, making it a robust tool for financial modeling.
- Mitigates LASSO's Limitations: In cases where the number of predictors ($p$) is greater than the number of observations ($n$), LASSO can select at most $n$ variables before it saturates. It can also behave erratically with highly correlated features. Elastic Net alleviates these issues, providing a more stable and comprehensive solution (Zou and Hastie 301-302).
Computation¶
This section demonstrates the implementation of Elastic Net regression on a synthetic financial dataset. We will:
- Generate sample data simulating asset returns based on factors.
- Split the data into training and testing sets.
- Train an Elastic Net model.
- Tune the hyperparameters using $GridSearchCV$.
- Evaluate the model's performance and examine the coefficients.
Disadvantages¶
Despite its strengths, the Elastic Net approach has some known difficulties and issues:
- Computational Cost: The addition of a second hyperparameter (compared to a single one in LASSO or Ridge) increases the computational complexity for model tuning. Finding the optimal combination of the L1 and L2 penalty strengths requires a more exhaustive search (e.g., a two-dimensional grid search), which can be computationally expensive for very large datasets.
- Hyperparameter Sensitivity: The model's performance is highly sensitive to the values of the two tuning parameters, alpha ($\lambda$) and l1_ratio. Suboptimal choices can lead to either under-regularization (model overfits) or over-regularization (model underfits), making the tuning process crucial yet non-trivial.
- Interpretation Compromise: While it performs variable selection like LASSO, the resulting model is not as straightforward to interpret as a pure LASSO model. Because the Ridge penalty shrinks but rarely zeroes out coefficients, the final set of features may include more variables than a LASSO model would, slightly reducing sparsity and interpretability.
- Not a Cure-All: It is still a linear model. It inherits all the limitations of linear regression, such as the inability to capture complex non-linear relationships or interactions between features without manual feature engineering.
Equations¶
The Elastic Net model is formulated as an optimization problem where the goal is to find the coefficients $\beta$ that minimize a penalized residual sum of squares. The objective function can be broken down into three distinct components:
Objective Function:¶
$$ \hat{\beta} = \arg\min_{\beta} \left( \text{Loss Function} + \text{Penalty Term} \right) $$
1. Loss Function (Mean Squared Error):¶
This is the standard ordinary least squares (OLS) loss, which measures how well the model fits the data. $$ \text{MSE} = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{2N} \| \mathbf{y} - \mathbf{X}\beta \|_2^2 $$ where $\hat{y}_i = \beta_0 + \sum_{j=1}^{p} x_{ij}\beta_j$ is the predicted value.
2. Combined Penalty Term (Regularization):¶
This is the key innovation of Elastic Net, which is a convex combination of the L1 (LASSO) and L2 (Ridge) penalties. $$ \text{Penalty} = \lambda \left( \rho \| \beta \|_1 + \frac{(1 - \rho)}{2} \| \beta \|_2^2 \right), $$ where $\| \beta \|_1 = \sum_{j=1}^{p} |\beta_j|$ is the L1 norm (promotes sparsity); $\| \beta \|_2^2 = \sum_{j=1}^{p} \beta_j^2$ is the squared L2 norm(promotes shrinkage and handles correlation); $\lambda$ is the overall regularization strength ($$\lambda$ ≥ 0$) and $\rho$ is the mixing parameter ($0 ≤ \rho ≤ 1$), often called l1_ratio. Combining these parts gives the full Elastic Net objective function: $$ \hat{\beta} = \arg\min_{\beta} \left{ \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}i)^2 + \lambda \left( \rho \sum{j=1}^{p} |\beta_j| + \frac{(1 - \rho)}{2} \sum_{j=1}^{p} \beta_j^2 \right) \right}$$
Hyperparameters:
- $\lambda$ (lambda): Controls the overall strength of regularization. As $\lambda$ increases, all coefficients are shrunk more aggressively towards zero.
- $\rho$ (l1_ratio): Controls the blend of penalties.
- $\rho = 1$: Pure LASSO penalty (L1 only).
- $\rho = 0$: Pure Ridge penalty (L2 only).
- $0 < \rho < 1$: Elastic Net mix.
Features¶
The Elastic Net regression model possesses several key characteristics that define its behavior and suitability for various problems:
Linear Modeling: It is fundamentally a linear model, assuming a linear relationship between the independent variables (features) and the dependent variable (target).
Hybrid Regularization: Its core feature is the combination of L1 (LASSO) and L2 (Ridge) regularization penalties. This allows it to perform both variable selection and handle multicollinearity simultaneously (Zouu and Hastie 301).
Sparsity and Grouping Effect: The L1 penalty promotes sparsity by driving the coefficients of irrelevant features to exactly zero. The L2 penalty provides a "grouping effect," where strongly correlated features tend to have similar coefficient values, rather than one being arbitrarily dropped (Hastie, Tibshirani, and Wainwright 54).
Handling of Multicollinearity: Unlike LASSO, which can behave unpredictably with highly correlated features, Elastic Net is robust to multicollinearity, making it stable for financial datasets where factors are often correlated (e.g., different momentum indicators).
Performance in High-Dimensional Settings: It is particularly well-suited for datasets where the number of features ($p$) is large relative to the number of observations ($n$), a common scenario in quantitative finance ("large p, small n" problem) (Fan, Lv, and Qi 82).
Requires Feature Scaling: The model is not scale-invariant. It is crucial to standardize (zero mean, unit variance) or normalize features before training; otherwise, the regularization penalty will be unfairly applied to features on larger scales.
No Native Handling of Missing Values: Elastic Net models, as implemented in standard libraries like scikit-learn, cannot handle missing values directly. Missing data must be imputed (e.g., using mean, median, or more sophisticated methods) prior to model training (Zou and Hastie 307).
Guide¶
This section provides a clear list of the required inputs to operate an Elastic Net model and the resulting outputs it generates.
Inputs (What You Need to Provide):¶
1. Training Dataset:
- $X$ (Features): A 2D array (NumPy array or Pandas DataFrame) of shape (n_samples, n_features). This is your matrix of predictor variables (Gu, Kelly, and Xiu 5).
- Example: A dataframe where each row is a stock and each column is a financial factor (e.g., P/E ratio, volatility, momentum).
- $y$ (Target): A 1D array of shape (n_samples,) containing the continuous outcome variable (Giannone, Lenza, and Primiceri 15).
Example: The subsequent return for each stock in the $X$ matrix.
2. Preprocessed Data: The input data must be preprocessed. The model will not run correctly without this.
No Missing Values: The dataset must have all missing values imputed (filled in) beforehand (Zou and Hastie 307).
Standardized Features: All features must be standardized (scaled to have a mean of 0 and a standard deviation of 1). This is critical because the regularization penalty is applied equally only if all features are on the same scale (Hastie et al. 9).
3. Hyperparameters: These are the model's configuration settings, which must be set before training and typically require tuning.
- alpha ($\lambda$): The overall regularization strength. alpha=0 equates to ordinary least squares regression.
- l1_ratio $(\rho)$: The mix between L1 and L2 penalty. l1_ratio=1 is pure LASSO, l1_ratio=0 is pure Ridge, and values between 0 and 1 create an Elastic Net.
Outputs (What the Model Produces):¶
Fitted Model: The primary output is a trained ElasticNet object that encapsulates the learned relationship from the data and can be used to make predictions.
Coefficients (model.coef_): A 1D array of length n_features representing the estimated coefficient for each feature. This output is key for interpretation, as it shows the magnitude and direction of each feature's influence on the target after the regularization penalty has been applied. Many coefficients will be shrunk to zero (Zou and Hastie 302).
Intercept (model.intercept_): A scalar value representing the model's intercept term.
Predictions: An array of predicted values for the target variable when the predict() method is called on new input data (X_new).
Performance Metrics: After evaluation, metrics that quantify the model's quality.
- Mean Squared Error (MSE): The average squared difference between predicted and actual values.
- R-squared ($R²$): The proportion of variance in the target variable that is predictable from the features (James et al. 21).
Hyperparameters¶
Elastic Net regression has two key hyperparameters that must be carefully tuned, typically through cross-validation, to achieve optimal model performance. These parameters control the nature and strength of the regularization applied.
- lambda ($\lambda$ - Regularization Strength)
This hyperparameter controls the overall magnitude of the regularization penalty applied to the coefficients. It is the multiplier in front of the penalty term in the loss function.
A higher alpha value increases the amount of shrinkage, forcing coefficients toward zero more aggressively. This reduces model variance but increases bias.
- alpha = 0: No regularization (equivalent to Ordinary Least Squares regression).
- alpha -> $\infty$: All coefficients are shrunk to zero.
- Tuning: A grid search over a logarithmic scale (e.g., [0.001, 0.01, 0.1, 1.0, 10.0]) is standard practice.
- l1_ratio ($\rho$ - Mixing Parameter)
This hyperparameter determines the mix between the L1 (LASSO) and L2 (Ridge) penalties. It controls the type of regularization.
It defines the convex combination of the two penalties.
l1_ratio = 1: Pure LASSO regression (only L1 penalty). Promotes sparsity.
l1_ratio = 0: Pure Ridge regression (only L2 penalty). Promotes shrinkage of correlated features.
0 < l1_ratio < 1: Elastic Net mix. Balances variable selection and handling of multicollinearity.
- Tuning: A grid search over values between 0 and 1 (e.g., [0, 0.2, 0.5, 0.8, 1]) is used to find the optimal blend.
The optimal combination of alpha and l1_ratio is data-dependent and must be found empirically through techniques like GridSearchCV.
Illustration¶
The following diagram provides an intuitive visual representation of how Elastic Net combines the properties of both LASSO (L1) and Ridge (L2) regularization:
import requests
from IPython.display import Image, display
file_id = "1dQrikVLe07X0KWtBaAnvhYD_t7DK-hav"
url = f"https://drive.google.com/uc?export=download&id={file_id}"
resp = requests.get(url, allow_redirects=True, timeout=15)
resp.raise_for_status()
display(Image(data=resp.content, width=600))
print("Figure 1: Graphical illustration of Elastic Net regularization combining L1 and L2 penalty terms\n Arora (2023)")
Figure 1: Graphical illustration of Elastic Net regularization combining L1 and L2 penalty terms Arora (2023)
This illustration demonstrates the key conceptual framework of Elastic Net:
L1 Norm (LASSO): Promotes sparsity by driving coefficients to exactly zero, effectively performing variable selection
L2 Norm (Ridge): Encourages weight sharing among correlated variables, shrinking coefficients proportionally
L1 + L2 Norm (Elastic Net): Creates a compromise that balances both sparsity and grouping effects, requiring the tuning of two parameters (α and $\lambda$) to optimize this balance
The visual emphasizes that Elastic Net occupies a middle ground between the extreme sparsity of LASSO and the pure shrinkage of Ridge regression, making it particularly suitable for financial datasets where both variable selection and handling of multicollinearity are important considerations. (Arora, 2023)
Journal¶
Elastic Net regression has been successfully applied in finance to address high-dimensional prediction problems, particularly in empirical asset pricing. The following journal article provides a prominent example of its use:
Reference:¶
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies, vol. 33, no. 5, May 2020, pp. 2223–2273. DOI.org (Crossref), https://doi.org/10.1093/rfs/hhaa009.
Application¶
In this foundational paper, Gu, Kelly, and Xiu conduct a comprehensive analysis of machine learning methods for predicting stock returns. They utilize a vast set of firm characteristics, a classic "large p, small n" problem where the number of potential predictors (features) is large relative to the number of observations (stocks/time periods).
The authors employ Elastic Net as one of their key modeling techniques. Its primary role is to perform robust variable selection and regularization from the high-dimensional set of candidate factors. The Elastic Net model helps to identify a sparse subset of characteristics that provide persistent predictive power for returns, while its inherent regularization mitigates overfitting. The study demonstrates that Elastic Net, along with other Machine Learning methods, significantly outperforms traditional linear models in this context, highlighting its practical value for building sophisticated trading strategies and enhancing our understanding of return predictors.
This application directly aligns with the scenario presented in this project, where a "team of alpha quants" must sift through numerous factors to find a viable trading strategy.
Keywords:¶
Regularized Regression, L1-and-L2-Penalty, Variable-Selection, High-Dimensional-Data, Multicollinearity, Linear-Model, Supervised-Learning, Feature-Shrinkage.
References¶
Fan, Jianqing, Jinchi Lv, and Lingzhou Qi. "Sparse High-Dimensional Models in Economics." Annual Review of Economics, vol. 3, 2011, pp. 291-317. DOI.org (Crossref), https://doi.org/10.1146/annurev-economics-061109-080451.
Giannone, Domenico, Michele Lenza, and Giorgio E. Primiceri. "Economic Predictions with Big Data: The Illusion of Sparsity." Econometrica, vol. 89, no. 5, Sept. 2021, pp. 2409-2437. DOI.org (Crossref), https://doi.org/10.3982/ECTA16935.
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies, vol. 33, no. 5, May 2020, pp. 2223-2273. DOI.org (Crossref), https://doi.org/10.1093/rfs/hhaa009.
Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, 2015.
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.
James, Gareth, et al. An Introduction to Statistical Learning: with Applications in R. Springer, 2013.
Zou, Hui, and Trevor Hastie. "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, Apr. 2005, pp. 301-20. DOI.org (Crossref), https://doi.org/10.1111/j.1467-9868.2005.00503.x.
Arora, Puneet. "The Most Important Things You Need to Know About Elastic Net." Analytics Arora, 2023, analyticsarora.com/the-most-important-things-you-need-to-know-about-elastic-net/.
# ---------
# Packages
# ---------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")
# Set a random seed for reproducibility
np.random.seed(632)
# --------------------------------------------------------------------------------------------------------------------------------
# 1. Generate a Synthetic Financial Dataset
# We generate a dataset with 200 observations and 20 potential factors (features), but only 5 of which are truly informative.
# This simulates a common scenario in finance where only a subset of many factors actually predicts returns.
# We also add multicollinearity between some features.
# --------------------------------------------------------------------------------------------------------------------------------
# Generate synthetic data with correlated and noisy features
X, y, true_coefs = make_regression(n_samples=200, n_features=20, n_informative=5,
noise=0.3, coef=True, random_state=42)
# Introduce multicollinearity: Make feature 5 and 15 highly correlated
X[:, 5] = X[:, 15] + np.random.normal(0, 0.05, X.shape[0])
# Create a DataFrame for better visualization
feature_names = [f'Factor_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['Target_Return'] = y
# Display the correlation matrix of the first 8 factors to show multicollinearity
plt.figure(figsize=(10, 8))
sns.heatmap(df.iloc[:, :8].corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of First 8 Factors (Note Correlation between F5 & F15)')
save_path = 'corr_matrix.png'
# plt.savefig(save_path)
plt.show()
# -----------------------------------------------------------------------------------------------------------------
# 2. Preprocess Data and Train-Test Split
# It is crucial to standardize features for regularized regression models so that the penalty is applied uniformly.
# -----------------------------------------------------------------------------------------------------------------
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features (vital for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ---------------------------------------------------------------------------
# 3. Train and Evaluate a Baseline Elastic Net Model
# We first fit a model with arbitrary hyperparameters to establish a baseline.
# ---------------------------------------------------------------------------
# Initialize the Elastic Net model with some parameters
base_elastic = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
# Fit the model on the training data
base_elastic.fit(X_train_scaled, y_train)
# Make predictions
y_pred_train = base_elastic.predict(X_train_scaled)
y_pred_test = base_elastic.predict(X_test_scaled)
# Calculate performance metrics
train_mse = mean_squared_error(y_train, y_pred_train)
test_mse = mean_squared_error(y_test, y_pred_test)
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
print("Baseline Elastic Net Performance:")
print(f"Training MSE: {train_mse:.4f}, Training R²: {train_r2:.4f}")
print(f"Testing MSE: {test_mse:.4f}, Testing R²: {test_r2:.4f}")
# --------------------------------------------------------------------------------------------
# 4. Hyperparameter Tuning with GridSearch
# We perform a grid search to find the optimal combination of $alpha$ (\lambda) and $l1_ratio$ (\rho).
# --------------------------------------------------------------------------------------------
# Define the parameter grid to search
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0], # Overall regularization strength
'l1_ratio': [0, 0.1, 0.2, 0.5, 0.8, 0.9, 1] # 0=Ridge, 1=Lasso
}
# Initialize the GridSearchCV object
grid_search = GridSearchCV(ElasticNet(random_state=42, max_iter=10000),
param_grid,
cv=5, # 5-fold cross-validation
scoring='neg_mean_squared_error',
n_jobs=-1) # Use all available processors
# Perform the grid search on the training data
grid_search.fit(X_train_scaled, y_train)
# Print the best parameters found
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best Cross-Validation Score (Negative MSE): {grid_search.best_score_:.4f}")
# Get the best model
best_elastic = grid_search.best_estimator_
# -----------------------------------------------------------------------------------
# 5. Evaluate the Tuned Model
# We now evaluate the model with the optimal hyperparameters on the held-out test set.
# -----------------------------------------------------------------------------------
# Use the best model to make predictions
y_pred_test_tuned = best_elastic.predict(X_test_scaled)
# Calculate final performance metrics
test_mse_tuned = mean_squared_error(y_test, y_pred_test_tuned)
test_r2_tuned = r2_score(y_test, y_pred_test_tuned)
print("\nTuned Elastic Net Performance on Test Set:")
print(f"Test MSE: {test_mse_tuned:.4f}")
print(f"Test R²: {test_r2_tuned:.4f}")
# Compare coefficients: True vs. Estimated
results_df = pd.DataFrame({
'True_Coefficients': true_coefs,
'Estimated_Coefficients': best_elastic.coef_
}, index=feature_names)
# Plotting the coefficients for visualization
plt.figure(figsize=(12, 6))
x_index = np.arange(len(true_coefs))
width = 0.35
plt.bar(x_index - width/2, results_df['True_Coefficients'], width, label='True', color='blue', alpha=0.7)
plt.bar(x_index + width/2, results_df['Estimated_Coefficients'], width, label='Elastic Net Estimated', color='red', alpha=0.7)
plt.xlabel('Factors')
plt.ylabel('Coefficient Value')
plt.title('Comparison of True and Elastic Net Estimated Coefficients')
plt.legend()
plt.xticks(x_index, feature_names, rotation=45)
plt.tight_layout()
plt.show()
# Display the coefficient table for inspection
print("\nCoefficient Comparison Table:")
print(results_df.round(4))
Baseline Elastic Net Performance:
Training MSE: 45.7326, Training R²: 0.9978
Testing MSE: 54.6688, Testing R²: 0.9970
Best parameters found: {'alpha': 0.01, 'l1_ratio': 1}
Best Cross-Validation Score (Negative MSE): -0.1111
Tuned Elastic Net Performance on Test Set:
Test MSE: 0.0495
Test R²: 1.0000
Coefficient Comparison Table:
True_Coefficients Estimated_Coefficients
Factor_0 0.0000 0.0135
Factor_1 0.0000 0.0190
Factor_2 0.0000 -0.0000
Factor_3 0.0000 0.0000
Factor_4 48.5018 45.8948
Factor_5 0.0000 -0.0001
Factor_6 0.0000 0.0040
Factor_7 51.8010 50.0381
Factor_8 61.4186 61.7572
Factor_9 0.0000 -0.0000
Factor_10 0.0000 0.0323
Factor_11 0.0000 0.0112
Factor_12 0.0000 0.0000
Factor_13 97.2461 104.0006
Factor_14 0.0000 0.0122
Factor_15 0.0000 -0.0026
Factor_16 0.0000 -0.0060
Factor_17 0.0000 0.0000
Factor_18 8.5403 8.8602
Factor_19 0.0000 -0.0095
# =============================================================================
# FINANCIAL APPLICATION DEMONSTRATION
# =============================================================================
print("\n" + "="*70)
print("FINANCIAL APPLICATION: ELASTIC NET FOR FACTOR INVESTING")
print("="*70)
# -----------------------------------------------------------------------------
# Financial Context and Business Interpretation
# -----------------------------------------------------------------------------
financial_mapping = {
'Factor_4': 'Momentum_1M',
'Factor_7': 'Value_PE_Ratio',
'Factor_8': 'Quality_ROE',
'Factor_13': 'Size_MarketCap',
'Factor_18': 'Volatility_20D',
'Target_Return': 'Next_Month_Excess_Return'
}
true_predictive_factors = [feature_names[i] for i in range(len(true_coefs)) if abs(true_coefs[i]) > 0]
estimated_predictive_factors = [feature_names[i] for i in range(len(best_elastic.coef_))
if abs(best_elastic.coef_[i]) > 0.01]
print("\nBUSINESS INTERPRETATION:")
print("-" * 50)
print("SCENARIO: Predicting excess returns using 20 potential factors")
print("CHALLENGE: Only 5 factors are truly predictive (simulated reality)")
print("SOLUTION: Elastic Net for robust factor selection and regularization")
print(f"\nRESULTS:")
print(f"• Predictive Accuracy: R² = {test_r2_tuned:.1%} out-of-sample")
print(f"• Factor Selection: Identified {len(estimated_predictive_factors)}/{len(feature_names)} factors as meaningful")
print(f"• True Positives: {len(set(true_predictive_factors) & set(estimated_predictive_factors))}/5 true factors correctly identified")
print(f"• Noise Reduction: {len(feature_names) - len(estimated_predictive_factors)} spurious factors eliminated")
# -----------------------------------------------------------------------------
# Financial Performance Metrics
# -----------------------------------------------------------------------------
annualized_volatility = np.sqrt(test_mse_tuned * 252) # Assuming daily returns
information_ratio = np.mean(y_pred_test_tuned) / np.std(y_pred_test_tuned) if np.std(y_pred_test_tuned) > 0 else 0
print(f"\nFINANCIAL PERFORMANCE METRICS:")
print(f"• Prediction Volatility (Annualized): {annualized_volatility:.2%}")
print(f"• Information Ratio: {information_ratio:.2f}")
# -----------------------------------------------------------------------------
# Portfolio Construction Insight
# -----------------------------------------------------------------------------
print(f"\nPORTFOLIO CONSTRUCTION INSIGHTS:")
print(f"• Sparse Model: Only {len(estimated_predictive_factors)} factors needed for strategy")
print(f"• Transaction Costs: Reduced by eliminating noisy factors")
print(f"• Model Stability: Robust to multicollinearity (e.g., Factor_5 & Factor_15)")
if len(estimated_predictive_factors) > 0:
print(f"\nKEY IDENTIFIED FACTORS (with financial interpretation):")
for i, factor_idx in enumerate([feature_names.index(f) for f in estimated_predictive_factors if f in feature_names][:3]):
coef_value = best_elastic.coef_[factor_idx]
magnitude = "Strong" if abs(coef_value) > 10 else "Moderate" if abs(coef_value) > 5 else "Weak"
direction = "Positive" if coef_value > 0 else "Negative"
factor_type = "Momentum" if factor_idx % 4 == 0 else "Value" if factor_idx % 4 == 1 else "Quality" if factor_idx % 4 == 2 else "Risk"
print(f" {i+1}. {factor_type} Factor ({direction} impact, {magnitude} magnitude)")
# -----------------------------------------------------------------------------
# Comparison with Traditional Methods
# -----------------------------------------------------------------------------
print(f"\nCOMPARISON WITH TRADITIONAL APPROACHES:")
print(f"• OLS Regression: Would overfit with 20 factors and 200 observations")
print(f"• Stepwise Selection: Might miss important interactions and correlations")
print(f"• Elastic Net Advantage: Automatic feature selection + handling of multicollinearity")
# -----------------------------------------------------------------------------
# Risk Management Considerations
# -----------------------------------------------------------------------------
print(f"\nRISK MANAGEMENT BENEFITS:")
print(f"• Model Interpretability: Clear which factors drive returns")
print(f"• Reduced Overfitting: Regularization prevents chasing noise")
print(f"• Robustness: Stable performance across market regimes")
print("\n" + "="*70)
print("CONCLUSION: Elastic Net provides a mathematically rigorous framework")
print("for distinguishing true alpha factors from statistical noise in finance.")
print("="*70)
# =============================================================================
# Additional Visualization: Financial Factor Importance
# =============================================================================
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
factor_importance = pd.DataFrame({
'factor': feature_names,
'coefficient': best_elastic.coef_,
'abs_importance': np.abs(best_elastic.coef_)
}).sort_values('abs_importance', ascending=False)
colors = []
for i, factor in enumerate(factor_importance['factor']):
idx = int(factor.split('_')[1])
if idx in [4, 7, 8, 13, 18]: # True predictive factors
colors.append('green') # True signals
elif abs(best_elastic.coef_[i]) > 1:
colors.append('blue') # Strong estimated signals
else:
colors.append('gray') # Noise factors
bars = plt.barh(range(len(factor_importance)), factor_importance['abs_importance'], color=colors)
plt.xlabel('Absolute Coefficient Magnitude (Importance)')
plt.title('Financial Factor Importance\n(Green=True Signal, Blue=Selected, Gray=Noise)')
plt.yticks(range(len(factor_importance)), factor_importance['factor'])
plt.gca().invert_yaxis()
# Plot 2: Performance comparison
plt.subplot(1, 2, 2)
methods = ['Baseline\n(Untuned)', 'Tuned Elastic Net']
mse_values = [test_mse, test_mse_tuned]
bars = plt.bar(methods, mse_values, color=['red', 'green'], alpha=0.7)
plt.ylabel('Test MSE (Lower = Better)')
plt.title('Model Performance: Baseline vs Tuned')
for bar, value in zip(bars, mse_values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
f'{value:.4f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
# =============================================================================
# Final Summary Table
# =============================================================================
summary_table = pd.DataFrame({
'Metric': ['Out-of-Sample R²', 'Factors Selected', 'Noise Factors Eliminated',
'True Factors Identified', 'Performance Improvement'],
'Value': [f'{test_r2_tuned:.1%}', f'{len(estimated_predictive_factors)}/20',
f'{20 - len(estimated_predictive_factors)}',
f'{len(set(true_predictive_factors) & set(estimated_predictive_factors))}/5',
f'{((test_mse - test_mse_tuned) / test_mse * 100):.1f}% MSE reduction'],
'Interpretation': ['Excellent predictive power', 'Sparse, interpretable model',
'Reduced transaction costs', 'Accurate signal detection',
'Substantial tuning benefit']
})
print("\nSUMMARY TABLE: Financial Application Results")
print("="*65)
print(summary_table.to_string(index=False))
print("="*65)
====================================================================== FINANCIAL APPLICATION: ELASTIC NET FOR FACTOR INVESTING ====================================================================== BUSINESS INTERPRETATION: -------------------------------------------------- SCENARIO: Predicting excess returns using 20 potential factors CHALLENGE: Only 5 factors are truly predictive (simulated reality) SOLUTION: Elastic Net for robust factor selection and regularization RESULTS: • Predictive Accuracy: R² = 100.0% out-of-sample • Factor Selection: Identified 10/20 factors as meaningful • True Positives: 5/5 true factors correctly identified • Noise Reduction: 10 spurious factors eliminated FINANCIAL PERFORMANCE METRICS: • Prediction Volatility (Annualized): 353.05% • Information Ratio: -0.13 PORTFOLIO CONSTRUCTION INSIGHTS: • Sparse Model: Only 10 factors needed for strategy • Transaction Costs: Reduced by eliminating noisy factors • Model Stability: Robust to multicollinearity (e.g., Factor_5 & Factor_15) KEY IDENTIFIED FACTORS (with financial interpretation): 1. Momentum Factor (Positive impact, Weak magnitude) 2. Value Factor (Positive impact, Weak magnitude) 3. Momentum Factor (Positive impact, Strong magnitude) COMPARISON WITH TRADITIONAL APPROACHES: • OLS Regression: Would overfit with 20 factors and 200 observations • Stepwise Selection: Might miss important interactions and correlations • Elastic Net Advantage: Automatic feature selection + handling of multicollinearity RISK MANAGEMENT BENEFITS: • Model Interpretability: Clear which factors drive returns • Reduced Overfitting: Regularization prevents chasing noise • Robustness: Stable performance across market regimes ====================================================================== CONCLUSION: Elastic Net provides a mathematically rigorous framework for distinguishing true alpha factors from statistical noise in finance. ======================================================================
SUMMARY TABLE: Financial Application Results
=================================================================
Metric Value Interpretation
Out-of-Sample R² 100.0% Excellent predictive power
Factors Selected 10/20 Sparse, interpretable model
Noise Factors Eliminated 10 Reduced transaction costs
True Factors Identified 5/5 Accurate signal detection
Performance Improvement 99.9% MSE reduction Substantial tuning benefit
=================================================================
Interpretation of Results
The output from the Elastic Net regression provides a clear demonstration of its efficacy and the importance of hyperparameter tuning for a financial factor model.
1. Dramatic Impact of Hyperparameter Tuning: The baseline model, using arbitrary parameters (alpha=0.1, l1_ratio=0.5), was severely overfit. This is evident from the extremely high MSE values (~45.7 on training, ~54.7 on testing) despite the deceptively high R² scores. These MSE values indicate large prediction errors. The tuned model, with optimal parameters {'alpha': 0.01, 'l1_ratio': 1}, reduced the Test MSE by over 99.9% (from 54.67 to 0.05) and achieved a perfect Test R² of 1.0. This highlights a critical lesson: default parameters can be disastrous for regularized models, and systematic tuning is non-negotiable.
2. Effective Variable Selection and Model Sparsity: The tuned model successfully identified the true signal from the noise. The $l1_ratio$ of 1 means the model defaulted to a pure LASSO penalty, which was optimal for this dataset.
- True Signals Captured: The model correctly identified the five informative factors (4, 7, 8, 13, 18) and assigned them large, non-zero coefficients very close to their true values (e.g., Factor_13: True=97.25 vs. Estimated=104.00).
- Noise Suppressed: The model drove the coefficients of the 15 non-informative factors to near-zero values (e.g., Factors 0, 1, 2, 3, 5, 6, etc.). This sparsity simplifies the model and reduces the risk of overfitting on spurious correlations, which is crucial in finance.
3. Handling of Multicollinearity: The results show the model's behavior with the correlated features we engineered (Factor_5 and Factor_15). Both coefficients were shrunk to approximately zero ($-0.0001$ and $-0.0026$). This is a rational outcome: since neither feature truly drives the target return (their true coefficients are zero), the model correctly discarded both. In a scenario where correlated features were informative, Elastic Net would typically shrink their coefficients similarly rather than arbitrarily selecting one, which is a key advantage over pure LASSO.
Conclusion for Financial Modeling: This exercise underscores that Elastic Net, when properly tuned, is a powerful tool for building parsimonious factor models. It can effectively distill a large set of potential factors down to the most predictive few, mitigating overfitting and enhancing the model's ability to generalize to new data, a fundamental requirement for a robust trading strategy.
K-Means Clustering¶
K-Means clustering is an unsupervised learning algorithm that partitions a dataset into k groups by minimizing within-cluster variance (Hastie et al. 2009). The algorithm iteratively assigns observations to the closest centroid and updates the centroids until no further changes occur.
Advantages¶
- Simple and Fast: Lloyd-style k-means is one of the fastest clustering methods, with an average complexity of approximately O(k · n · T), where k is the number of clusters, n is the number of points, and T is the number of iterations. It scales well for moderate datasets, and MiniBatchKMeans is available for larger datasets (KMeans — scikit-learn 1.7.2 Documentation, 2025).
- Easy to Explain: The objective is to minimize the within-cluster sum of squares (WCSS, or inertia), which is intuitive as it involves assigning points to the nearest centroid (cluster mean) (Agustsson, Timofte, and Van Gool, 2017).
- Good Baselines: With careful initialization (e.g., k-means++), convergence is faster, and solutions avoid many poor local minima.
Basics¶
- What It Is: K-means is an unsupervised, prototype-based algorithm that partitions n observations into k clusters by minimizing WCSS under squared Euclidean distance (Agustsson, Timofte, and Van Gool, 2017).
- Two-Step Iteration (Lloyd’s Algorithm): (1) Assign each point to the nearest centroid; (2) Recompute centroids as the mean of points in each cluster. Repeat until convergence.
- Optimization Landscape: Finding the global optimum is NP-hard, but practical algorithms like k-means++ find good local optima (KMeans — scikit-learn 1.7.2 Documentation, 2025).
Computation¶
- Preprocessing: Handle missing values (NAs). Scale numeric features to comparable ranges if units differ; compare scaled vs. unscaled results if units are the same (Wongoutong, 2024).
- Choose k: Use the elbow method (plotting k vs. inertia), silhouette score, or optionally the gap statistic to justify the number of clusters.
- Fit: Use init = "k-means++" for better initialization, multiple restarts (n_init) for robustness, and record cluster centers, labels, and inertia (KMeans — scikit-learn 1.7.2 Documentation, 2025).
- Validating/Interpreting: Analyze cluster profiles (e.g., feature means or medians), silhouette score distribution, and provide business interpretations of clusters.
Disadvantages¶
- Shape and Variance Assumptions: K-means assumes roughly spherical clusters with equal variance under Euclidean distance (KMeans — scikit-learn 1.7.2 Documentation, 2025).
- Sensitivity to Feature Scale: Features with larger ranges dominate distances; scaling is often necessary when units differ (Wongoutong, 2024).
- Initialization and Local Minima: Different initial seeds can lead to different results. Mitigate with k-means++ and higher n_init (KMeans — scikit-learn 1.7.2 Documentation, 2025).
- Outliers and Empty Clusters: Outliers can distort centroids, and empty clusters may occur during iterations (KMeans — scikit-learn 1.7.2 Documentation, 2025).
Equations¶
Let $X = \{x_{1}, \ldots, x_{n}\} \subset \mathbb{R}^{d}$ be the dataset, partitioned into $k$ clusters $S_{1}, \ldots, S_{k}$ with centroids $\mu_{i} = \frac{1}{|S_{i}|} \sum_{x \in S_{i}} x$. The objective (inertia/WCSS) is:
$$ \min_{S_{1}, \ldots, S_{k}} \sum_{i=1}^{k} \sum_{x \in S_{i}} \|x - \mu_{i}\|^{2} $$
This distortion is minimized through Lloyd’s updates: assigning points to the nearest centroid and recomputing centroids (Agustsson, Timofte, and Van Gool, 2017).
Illustrations¶
- Elbow Plot: Plot k vs. inertia to identify the optimal number of clusters.
- Silhouette Plot: Show per-sample silhouette scores and the average score.
The silhouette score, defined as
$$ s(i) = \frac{b(i) - a(i)}{\max \{a(i), b(i)\}}, $$
where $a(i)$ is the average intra-cluster distance and $b(i)$ the minimum inter-cluster distance.
- 2D Projection: Use PCA or t-SNE, colored by cluster, with centroids annotated.
- Cluster Profile Table: Display per-cluster means or medians of key features, with a finance-friendly narrative.
Journal Application¶
Boloș et al. (2025), Symmetry: Applied k-means to profitability, liquidity, and solvency indicators to form portfolios with distinct risk-return profiles. The silhouette score outperformed the elbow method for selecting k. This study demonstrates k-means’ utility in clustering financial issuers for portfolio construction.
Keywords k-means, k-means++, inertia, WCSS, silhouette, elbow method, gap statistic, Lloyd’s algorithm, prototype-based clustering, Euclidean distance, unsupervised learning, MiniBatchKMeans.
Step 1 - Pulling daily Adjusted Close prices for a diversified list of stocks (AAPL, MSFT, NIO, TSLA, RIVN, TM, F, BYDDF, XOM, CVX, NEE, JPM, GS, BAC, KO, WMT, CAT), from 2018-01-01 to 2025-09-19, sanity-check what came back, and save tidy CSVs we’ll reuse in later steps (returns, features, clustering).
#Step 1: Data Pull & Staging for K-Means (2018–2025)
!pip -q install yfinance pandas numpy
import datetime as dt
import numpy as np
import pandas as pd
import yfinance as yf
# Define stock universe
SECTORS = {
"Technology": ["AAPL", "MSFT"],
"EV/Auto": ["NIO", "TSLA", "RIVN", "TM", "F", "BYDDF"],
"Energy": ["XOM", "CVX", "NEE"],
"Financials": ["JPM", "GS", "BAC"],
"Consumer/Industrial": ["KO", "WMT", "CAT"]
}
TICKERS = sorted({t for lst in SECTORS.values() for t in lst})
# Fixed date range
start_date = "2018-01-01"
end_date = "2025-09-19" # as of specific date given
print(f"Downloading {len(TICKERS)} tickers from {start_date} to {end_date}...")
print("Tickers:", TICKERS)
# Download Adjusted Close prices
raw = yf.download(
tickers=TICKERS,
start=start_date,
end=end_date,
auto_adjust=False,
actions=False,
progress=True,
group_by='ticker',
interval='1d'
)
# Extract Adjusted Close into wide DataFrame
def extract_adj_close(raw_df, tickers):
out = {}
for t in tickers:
try:
s = raw_df[t]['Adj Close'].rename(t)
except Exception:
try:
s = raw_df['Adj Close'][t].rename(t)
except Exception:
s = pd.Series(dtype=float, name=t)
out[t] = s
df_prices = pd.concat(out, axis=1)
df_prices.index = pd.to_datetime(df_prices.index)
df_prices = df_prices.sort_index()
return df_prices
prices = extract_adj_close(raw, TICKERS)
# Coverage summary
summary = pd.DataFrame({
"first_date": prices.apply(lambda s: s.first_valid_index()),
"last_date": prices.apply(lambda s: s.last_valid_index()),
"obs_count": prices.notna().sum(),
"missing_count": prices.isna().sum(),
})
summary["missing_pct"] = (
summary["missing_count"] / (summary["obs_count"] + summary["missing_count"]).replace(0, np.nan) * 100
)
print("Coverage Summary (per ticker)")
display(summary.sort_index())
if ("BYDDF" in prices.columns) and (prices["BYDDF"].notna().sum() < 500):
print("\n[Hint] BYDDF (OTC) is sparse. Switch to '1211.HK' (Hong Kong listing) if needed.")
# Save staged prices
prices.to_csv("prices_adjclose_2018_2025.csv", index=True)
print("\nSaved: prices_adjclose_2018_2025.csv")
# Quick preview
print("\nHead:")
display(prices.head())
print("\nTail:")
display(prices.tail())
Downloading 17 tickers from 2018-01-01 to 2025-09-19... Tickers: ['AAPL', 'BAC', 'BYDDF', 'CAT', 'CVX', 'F', 'GS', 'JPM', 'KO', 'MSFT', 'NEE', 'NIO', 'RIVN', 'TM', 'TSLA', 'WMT', 'XOM']
[*********************100%***********************] 17 of 17 completed
Coverage Summary (per ticker)
| first_date | last_date | obs_count | missing_count | missing_pct | |
|---|---|---|---|---|---|
| AAPL | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| BAC | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| BYDDF | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| CAT | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| CVX | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| F | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| GS | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| JPM | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| KO | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| MSFT | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| NEE | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| NIO | 2018-09-12 | 2025-09-18 | 1764 | 175 | 9.025271 |
| RIVN | 2021-11-10 | 2025-09-18 | 967 | 972 | 50.128932 |
| TM | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| TSLA | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| WMT | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
| XOM | 2018-01-02 | 2025-09-18 | 1939 | 0 | 0.000000 |
Saved: prices_adjclose_2018_2025.csv Head:
| AAPL | BAC | BYDDF | CAT | CVX | F | GS | JPM | KO | MSFT | NEE | NIO | RIVN | TM | TSLA | WMT | XOM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||
| 2018-01-02 | 40.380997 | 24.905537 | 2.934475 | 132.230423 | 91.336258 | 8.358680 | 215.328186 | 87.152008 | 35.742023 | 79.198349 | 31.994013 | NaN | NaN | 104.699623 | 21.368668 | 28.973356 | 59.486675 |
| 2018-01-03 | 40.373974 | 24.822237 | 3.036572 | 132.432526 | 92.002029 | 8.424705 | 213.323730 | 87.240829 | 35.663540 | 79.566917 | 31.315298 | NaN | NaN | 106.135101 | 21.150000 | 29.226093 | 60.654995 |
| 2018-01-04 | 40.561497 | 25.147089 | 2.990464 | 134.251343 | 91.715691 | 8.569959 | 216.305176 | 88.490593 | 36.165840 | 80.267212 | 31.154409 | NaN | NaN | 107.790794 | 20.974667 | 29.252539 | 60.738964 |
| 2018-01-05 | 41.023304 | 25.263710 | 2.967409 | 136.373199 | 91.565331 | 8.715212 | 215.201889 | 87.922516 | 36.157993 | 81.262375 | 31.296741 | NaN | NaN | 109.177322 | 21.105333 | 29.425920 | 60.689968 |
| 2018-01-08 | 40.870937 | 25.088785 | 2.973996 | 139.800217 | 92.016357 | 8.682199 | 212.077255 | 88.052391 | 36.103058 | 81.345306 | 31.554604 | NaN | NaN | 109.919548 | 22.427334 | 29.860865 | 60.962830 |
Tail:
| AAPL | BAC | BYDDF | CAT | CVX | F | GS | JPM | KO | MSFT | NEE | NIO | RIVN | TM | TSLA | WMT | XOM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||
| 2025-09-12 | 234.070007 | 50.580002 | 13.39 | 431.519989 | 157.110001 | 11.68 | 780.059998 | 306.910004 | 66.500000 | 509.899994 | 71.639999 | 6.22 | 13.46 | 196.130005 | 395.940002 | 103.489998 | 112.160004 |
| 2025-09-15 | 236.699997 | 50.590000 | 13.95 | 435.940002 | 157.309998 | 11.68 | 786.760010 | 308.899994 | 66.209999 | 515.359985 | 71.500000 | 6.49 | 13.60 | 197.279999 | 410.040009 | 103.690002 | 112.349998 |
| 2025-09-16 | 238.149994 | 50.660000 | 14.25 | 440.670013 | 159.539993 | 11.61 | 785.530029 | 309.190002 | 66.239998 | 509.040009 | 69.830002 | 7.02 | 14.32 | 198.779999 | 421.619995 | 103.419998 | 114.680000 |
| 2025-09-17 | 238.990005 | 51.400002 | 14.45 | 450.660004 | 160.089996 | 11.66 | 794.219971 | 311.750000 | 67.040001 | 510.019989 | 70.309998 | 7.45 | 14.11 | 201.380005 | 425.859985 | 104.269997 | 115.290001 |
| 2025-09-18 | 237.880005 | 52.130001 | 14.56 | 466.959991 | 158.839996 | 11.74 | 804.309998 | 313.230011 | 66.459999 | 508.450012 | 70.790001 | 7.37 | 14.68 | 200.470001 | 416.850006 | 103.599998 | 113.930000 |
Step 2 — Feature engineering (returns, MAs, vol, momentum)
# --- Step 2: Feature Engineering for K-Means
import numpy as np
import pandas as pd
# 2.1 Daily returns
ret = prices.pct_change().dropna(how="all")
# 2.2 Rolling features per ticker
def make_features(r: pd.DataFrame) -> pd.DataFrame:
f = pd.DataFrame(index=r.index)
for t in r.columns:
f[f"{t}_r1"] = r[t].shift(1)
f[f"{t}_r5"] = r[t].rolling(5).mean().shift(1)
f[f"{t}_r20"] = r[t].rolling(20).mean().shift(1)
f[f"{t}_vol20"] = r[t].rolling(20).std(ddof=0).shift(1)
f[f"{t}_mom20"] = (1 + r[t]).rolling(20).apply(lambda x: np.prod(x) - 1, raw=True).shift(1)
return f
X_raw = make_features(ret).dropna(how="any")
X_raw.shape, X_raw.head()
((946, 85),
AAPL_r1 AAPL_r5 AAPL_r20 AAPL_vol20 AAPL_mom20 BAC_r1 \
Date
2021-12-10 -0.002970 0.013008 0.008442 0.016024 0.180098 0.007473
2021-12-13 0.028013 0.020955 0.009860 0.016434 0.213566 0.000674
2021-12-14 -0.020674 0.012520 0.008109 0.017681 0.171678 -0.021114
2021-12-15 -0.008023 0.003826 0.007705 0.017951 0.162200 0.012620
2021-12-16 0.028509 0.004971 0.008797 0.018510 0.187417 -0.004306
BAC_r5 BAC_r20 BAC_vol20 BAC_mom20 ... WMT_r1 WMT_r5 \
Date ...
2021-12-10 -0.001697 -0.002599 0.017041 -0.053480 ... 0.013909 0.005292
2021-12-13 0.002982 -0.002840 0.016959 -0.058027 ... 0.018267 0.005934
2021-12-14 -0.002517 -0.003285 0.017312 -0.066515 ... 0.018010 0.007369
2021-12-15 -0.002530 -0.002803 0.017611 -0.057547 ... 0.009542 0.009925
2021-12-16 -0.000930 -0.003051 0.017596 -0.062203 ... 0.005727 0.013091
WMT_r20 WMT_vol20 WMT_mom20 XOM_r1 XOM_r5 XOM_r20 \
Date
2021-12-10 -0.003184 0.012287 -0.063206 0.002562 0.004325 -0.001081
2021-12-13 -0.002290 0.013135 -0.046479 0.006389 0.006876 -0.000855
2021-12-14 -0.001141 0.013836 -0.024445 -0.021901 0.000229 -0.001569
2021-12-15 -0.000376 0.013982 -0.009437 -0.001460 -0.002304 -0.002073
2021-12-16 0.001183 0.012786 0.022260 -0.004387 -0.003760 -0.002797
XOM_vol20 XOM_mom20
Date
2021-12-10 0.018039 -0.024615
2021-12-13 0.018102 -0.020215
2021-12-14 0.018629 -0.034315
2021-12-15 0.018482 -0.043965
2021-12-16 0.018274 -0.057674
[5 rows x 85 columns])
Step 3 — Preprocess (scaling)
# -Step 3: Scaling (important for Euclidean distance)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
XZ = scaler.fit_transform(X_raw)
# Keep aligned index for later joins/plots
X_index = X_raw.index
X_cols = X_raw.columns
print("Feature matrix shape:", XZ.shape)
Feature matrix shape: (946, 85)
Step 4 — Choose K (elbow + silhouette)
# Step 4: K selection via elbow (inertia) and silhouette
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import os
Ks = range(2, 10)
inertia = []
sil_scores = []
for k in Ks:
km = KMeans(n_clusters=k, n_init=50, init="k-means++", random_state=42)
labels = km.fit_predict(XZ)
inertia.append(km.inertia_) # elbow
sil_scores.append(silhouette_score(XZ, labels)) # silhouette
fig, ax = plt.subplots(1, 2, figsize=(10,4))
ax[0].plot(list(Ks), inertia, marker="o"); ax[0].set_title("Elbow (Inertia vs K)")
ax[0].set_xlabel("K"); ax[0].set_ylabel("Inertia")
ax[1].plot(list(Ks), sil_scores, marker="o"); ax[1].set_title("Silhouette vs K")
ax[1].set_xlabel("K"); ax[1].set_ylabel("Silhouette")
plt.tight_layout()
# Create the 'assets' directory if it doesn't exist
if not os.path.exists('assets'):
os.makedirs('assets')
plt.savefig("assets/kmeans_elbow_silhouette.png", dpi=180, bbox_inches="tight")
plt.show()
# pick K by max silhouette (simple rule; you can override by eyeballing elbow)
best_K = Ks[int(np.argmax(sil_scores))]
print("Chosen K (by silhouette):", best_K)
Chosen K (by silhouette): 2
Step 5 — Fit final k-means
# Step 5: Final fit
kmeans = KMeans(n_clusters=best_K, n_init=100, init="k-means++", random_state=42)
labels = kmeans.fit_predict(XZ)
centroids_scaled = pd.DataFrame(kmeans.cluster_centers_, columns=X_cols)
centroids_unscaled = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=X_cols)
clusters = pd.Series(labels, index=X_index, name="cluster")
clusters.value_counts().sort_index()
cluster 0 295 1 651 Name: count, dtype: int64
Step 6 — Interpret with cluster profiles (finance-flavoured)
# Step 6: Cluster profiling
REF = "AAPL" if "AAPL" in ret.columns else ret.columns[0]
fwd = ret[REF].shift(-1).reindex(X_index)
vol_cols = [c for c in X_cols if c.endswith("_vol20")]
profile = pd.DataFrame({
"n_obs": clusters.value_counts().sort_index(),
"avg_next_day_ret": fwd.groupby(clusters).mean().sort_index(),
"avg_vol20": X_raw[vol_cols].mean(axis=1).groupby(clusters).mean().sort_index()
}).round(6)
print(profile)
ax = profile["avg_next_day_ret"].plot(kind="bar", title=f"Avg Next-Day Return ({REF}) by Cluster")
ax.set_xlabel("Cluster"); ax.set_ylabel("Avg next-day return")
plt.tight_layout(); plt.savefig("assets/kmeans_profile_return.png", dpi=180, bbox_inches="tight"); plt.show()
ax = profile["avg_vol20"].plot(kind="bar", title="Avg 20-day Volatility (mean across tickers) by Cluster")
ax.set_xlabel("Cluster"); ax.set_ylabel("Avg vol (20d)")
plt.tight_layout(); plt.savefig("assets/kmeans_profile_vol.png", dpi=180, bbox_inches="tight"); plt.show()
cluster_feature_means = pd.DataFrame(XZ, index=X_index, columns=X_cols).groupby(clusters).mean()
cluster_feature_means.round(3).head()
n_obs avg_next_day_ret avg_vol20 cluster 0 295 0.000740 0.024367 1 651 0.000371 0.019622
| AAPL_r1 | AAPL_r5 | AAPL_r20 | AAPL_vol20 | AAPL_mom20 | BAC_r1 | BAC_r5 | BAC_r20 | BAC_vol20 | BAC_mom20 | ... | WMT_r1 | WMT_r5 | WMT_r20 | WMT_vol20 | WMT_mom20 | XOM_r1 | XOM_r5 | XOM_r20 | XOM_vol20 | XOM_mom20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||||||||||||||
| 0 | -0.218 | -0.324 | -0.712 | 0.527 | -0.715 | -0.265 | -0.497 | -0.980 | 0.561 | -0.979 | ... | -0.163 | -0.334 | -0.634 | 0.410 | -0.639 | -0.132 | -0.227 | -0.156 | 0.711 | -0.157 |
| 1 | 0.099 | 0.147 | 0.322 | -0.239 | 0.324 | 0.120 | 0.225 | 0.444 | -0.254 | 0.443 | ... | 0.074 | 0.152 | 0.287 | -0.186 | 0.289 | 0.060 | 0.103 | 0.071 | -0.322 | 0.071 |
2 rows × 85 columns
Step 7 — Visualize clusters in 2-D (PCA projection)
# --- Step 7: PCA 2-D visualization ---
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
XZ2 = pca.fit_transform(XZ)
plt.figure(figsize=(6,5))
for c in sorted(np.unique(labels)):
idx = labels == c
plt.scatter(XZ2[idx, 0], XZ2[idx, 1], s=8, alpha=0.65, label=f"cluster {c}")
plt.legend()
plt.title("K-Means clusters (PCA 2-D projection)")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.tight_layout(); plt.savefig("assets/kmeans_pca_scatter.png", dpi=180, bbox_inches="tight"); plt.show()
print("Explained variance by PC1, PC2:", np.round(pca.explained_variance_ratio_, 4))
Explained variance by PC1, PC2: [0.1675 0.1025]
Results and Interpretations
The dataset consisted of adjusted close prices obtained through yfinance for a diversified set of seventeen stocks across technology (AAPL, MSFT), electric vehicles and automotive (TSLA, NIO, RIVN, TM, F, BYDDF), energy (XOM, CVX, NEE), financials (JPM, GS, BAC), and consumer/industrial companies (KO, WMT, CAT). The time horizon extended from January 2018 to September 2025.
A coverage summary indicated that most tickers had nearly full histories (~1939 observations), although RIVN and NIO had shorter records due to later listings. Such heterogeneity in data availability is common in financial studies, reflecting the staggered entry of firms into public markets (Xu and Wunsch, 2005).
From these prices, daily returns were computed, and multiple rolling features were engineered to capture trading dynamics. For each ticker, the following were created: One-day lagged return Five- and twenty-day moving averages of returns Twenty-day realized volatility Twenty-day cumulative momentum These features are consistent with research conducted on clustering financial time series (Arthur and Vassilvitskii 2007). After feature engineering, the dataset contained 946 observations across 85 features. Because K-Means is sensitive to the scale of input features, the dataset was standardized using z-scores. This transformation ensures that volatility, momentum, and return measures contribute equally to the clustering process (Hastie et al. 2009).
Classification Trees¶
A classification tree is a non-parametric, supervised learning algorithm that is used to predict a discrete, categorical outcome. The model's core mechanism is binary recursive partitioning, which systematically divides the feature space into a set of rectangles or regions. The goal is to create progressively smaller groups that are as homogeneous as possible with respect to the dependent variable. This hierarchical process of splitting the data by features generates a set of simple, interpretable rules for classification (Hastie et al. 305, Loh, and Vanichsetakul 715, Loh, 1).
Keywords: Classification Tree, Decision Tree, Binary Recursive Partitioning, Node, Branch, Split, Impurity, Gini Impurity, Entropy, Information Gain, Pruning,Supervised Learning.
1. Basic terminology¶
The structure of a classification tree can be understood through the following terms:
- Root Node: The starting point of the tree, representing the entire dataset.
- Internal Node (Decision Node): A node that represents a condition based on a feature's value.
- Branch: A line connecting nodes, representing the flow of decisions.
- Leaf Node (Terminal Node): The endpoint of a branch containing the final classification.
- Split: A rule that partitions data into child nodes. The quality of a split is determined by how well it reduces impurity.
- Impurity: A measure of heterogeneity. A pure node has low impurity, meaning most of its data points belong to the same class.
import requests
from IPython.display import Image, display
file_id = "1Umvf379T7eSN_F706uEqabkCmzdjGiqs"
url = f"https://drive.google.com/uc?export=download&id={file_id}"
resp = requests.get(url, allow_redirects=True, timeout=15)
resp.raise_for_status()
display(Image(data=resp.content, width=600))
print("Figure 1: Classification tree structure\n Authors (2025)")
Figure 1: Classification tree structure Authors (2025)
2. Equations¶
The fundamental mathematical representation of a classification tree partitions the feature space into M disjoint regions (terminal nodes). For a new observation with feature vector X, the prediction is determined by which region it falls into: $$\hat{f}(X) = \sum_{m=1}^{M} c_m \cdot I\{(X_1, X_2, \dots, X_p) \in R_m\}$$ where:
- $c_m$ is the predicted class in region $R_m$ (typically the majority class)
- $I\{\cdot\}$ is the indicator function that equals 1 if the condition is true, 0 otherwise
- $R_m$ represents the m-th region defined by the tree's splitting rules
For classification, $c_m = \arg\max_k \hat{p}_{mk}$, where $\hat{p}_{mk}$ is the proportion of class k observations in node m (Hastie et al. 305).
2.1 Node Impurity Measures¶
The core of tree construction involves measuring node impurity to determine optimal splits. For a node $m$ containing $N_m$ observations, we define the proportion of class $k$ observations as: $$\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)$$
2.1.1 Gini Impurity¶
Gini impurity measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the class distribution in the node. It reaches its minimum (zero) when all cases in the node belong to a single class.
$$\text{Gini}(m) = 1 - \sum_{k=1}^{K} \hat{p}_{mk}^2 = \sum_{k=1}^{K} \hat{p}_{mk}(1 - \hat{p}_{mk})$$
Properties:
- Range: [0, 0.5] for binary classification, [0, 1-1/K] for K classes
- Maximum when classes are equally distributed
- Convex function, making it sensitive to changes in class probabilities
Example: For a binary classification with class proportions p = 0.7 and 1-p = 0.3: $\text{Gini} = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 0.42$
2.1.2 Entropy and Information Gain¶
Entropy measures the uncertainty or disorder in a system. In classification trees, it quantifies the amount of information needed to specify the class of a randomly chosen observation from the node. A pure node requires zero additional information. $$H(m) = -\sum_{k=1}^{K} \hat{p}_{mk} \log_2 \hat{p}_{mk}$$
Properties:
- Range: [0, log₂K]
- Maximum when classes are uniformly distributed
- More sensitive to probability changes than Gini impurity
- The base-2 logarithm gives entropy in bits
Information Gain: The reduction in entropy achieved by splitting a node:
$$\text{Gain}(S, m) = H(m) - \sum_{v \in \text{values}(S)} \frac{|m_v|}{|m|} H(m_v)$$
where S is the split, m_v are the child nodes, and $|m_v|/|m|$ are the proportions of data going to each child.
Example: For a binary classification with p = 0.7, 1-p = 0.3: $$H = -[0.7 \cdot \log_2(0.7) + 0.3 \cdot \log_2(0.3)] \approx -[0.7 \cdot (-0.515) + 0.3 \cdot (-1.737)] \approx 0.881$$
2.1.3 Misclassification Error¶
Misclassification error measures the proportion of observations that would be misclassified if we assigned the node to its majority class. It represents the simplest and most intuitive impurity measure.
$$\text{Error}(m) = 1 - \max_k \hat{p}_{mk}$$
Properties:
- Range: [0, 1-1/K]
- Not differentiable, making it less suitable for optimization
- Less sensitive to probability changes than Gini or Entropy
- Directly relates to the ultimate goal of classification accuracy
Example: For a binary classification with p = 0.7, 1-p = 0.3: $\text{Error} = 1 - \max(0.7, 0.3) = 1 - 0.7 = 0.3$
2.3 Comparative Analysis of Impurity Measures¶
For binary classification (where p is the proportion in class 2), the three measures simplify to:
- Gini: $2p(1-p)$
- Entropy: $-p\log_2 p - (1-p)\log_2(1-p)$
- Misclassification: $1 - \max(p, 1-p)$
Key Differences:
- Sensitivity: Entropy > Gini > Misclassification Error
- Differentiability: Gini and Entropy are differentiable, making them better for numerical optimization
- Computational Efficiency: Misclassification Error is fastest to compute
- Practical Usage: Gini is default in CART, Entropy in ID3/C4.5
2.4 Splitting Criterion and Impurity Reduction¶
The optimal split at each node is chosen to maximize the reduction in impurity. For a potential split s that partitions node t into left and right child nodes $t_L$ and $t_R$: $$\Delta i(s, t) = i(t) - \frac{N_{t_L}}{N_t} \cdot i(t_L) - \frac{N_{t_R}}{N_t} \cdot i(t_R)$$ where:
- $i(t)$ is the impurity of parent node $t$ (using Gini, Entropy, or Misclassification)
- $N_{t_L}$ and $N_{t_R}$ are the number of observations in left and right child nodes
- $\frac{N_{t_L}}{N_t}$ and $\frac{N_{t_R}}{N_t}$ are the proportions of data sent to each child
The algorithm evaluates all possible splits and selects the one with maximum $\Delta i(s, t)$.
3. How Classification Trees Work¶
The process of building and using a classification tree is fundamentally intuitive, much like playing a game of "20 Questions." You start with a complex, mixed group of data and ask a series of binary (yes/no) questions. Each question acts as a split, designed to best separate the data into increasingly homogeneous subgroups, meaning that with each split, the items within a new group are more similar to each other than they were before. The ultimate goal is to continue this process until you reach endpoints, or leaf nodes, where the group is pure (containing only one class) or the decisions can be made with high confidence (Loh, 3-4).
This theoretical process is perfectly illustrated by your loan approval example:
Example: Imagine we need to decide whether to approve a loan application, so we use a classification tree that starts by checking if the applicant's credit score is above 720 (a common threshold for good credit). If it is not, we then examine their employment history; if it exceeds two years, we approve the loan with higher interest; otherwise, the loan is declined. If the credit score is above 720, we instead check if their debt-to-income ratio is below 36%; if so, the loan is approved, but if not, it is declined. This structured approach allows consistent and transparent decision-making by evaluating key financial criteria in sequence (Figure 2).
import requests
from IPython.display import Image, display
file_id = "1fukBcrpQXsKfENnubbJxWBqkzN0fn8_A"
url = f"https://drive.google.com/uc?export=download&id={file_id}"
resp = requests.get(url, allow_redirects=True, timeout=15)
resp.raise_for_status()
display(Image(data=resp.content, width=700))
print("Figure 2: Classification tree for loan approval decisions\n Authors (2025)")
Figure 2: Classification tree for loan approval decisions Authors (2025)
4. Algorithmic Properties and Performance Metrics¶
Classification trees employ a greedy algorithm that makes the locally optimal choice at each step without considering global optimality. The binary recursive partitioning process divides the feature space using axis-parallel splits, creating a hierarchical structure. Stopping criteria typically include minimum node size (e.g., 5 observations per terminal node) to prevent overfitting.
4.1 Performance Evaluation¶
For binary classification problems, key performance metrics include:
Sensitivity (Recall): $$\text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$
Specificity: $$\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}$$
The ROC curve plots sensitivity against (1 - specificity) across different classification thresholds. The Area Under Curve (AUC) provides an aggregate measure of performance across all thresholds and is equivalent to the c-statistic (Mann-Whitney U statistic).
5. Overfitting, Underfitting, and Pruning¶
Decision trees are high-variance estimators due to their greedy, top-down approach (Breiman et al., 36). This makes them inherently unstable, small changes in training data can produce dramatically different trees. They also struggle to capture additive relationships of the form $Y = c_1I(X_1 < t_1) + c_2I(X_2 < t_2) + \epsilon$.
Pruning addresses overfitting by simplifying the tree structure. The cost-complexity pruning criterion is: $$C_\alpha(T) = \sum_{m=1}^{|T|} N_m Q_m(T) + \alpha |T|$$ where:
- Q_m(T) is the impurity measure for terminal node m
- $|T|$ is the number of terminal nodes
- $\alpha \geq 0$ is the complexity parameter controlling the trade-off between tree size and fit
Types of pruning:
- Pre-pruning: Stop growth early using constraints like max_depth or min_samples_split
- Post-pruning: Grow full tree then remove branches using cost-complexity pruning
6. Advantages and Disadvantages¶
6.1 Advantages¶
The primary advantage is exceptional interpretability; the resulting model can be visualized and understood without statistical expertise. They require minimal data preprocessing, naturally handling both numerical and categorical data. Furthermore, they can model complex, non-linear relationships without variable transformation and perform automatic feature selection by choosing splits that best reduce impurity (Loh and Shih 815). Trees are scale-invariant and can handle missing data through surrogate splits.
6.2 Disadvantages¶
The most significant disadvantage is susceptibility to overfitting, creating complex trees that don't generalize well (high variance). They exhibit selection bias, favoring variables with more potential splits. The greedy nature means they may miss globally optimal partitions, and their prediction surfaces are non-smooth step functions.
7. Implementation Guide¶
Inputs¶
Learning sample $(L)$: A set of $N$ cases with feature vectors $x_n$ and class labels $j_n$: $$L = \{(x_n, j_n) \mid n = 1, \dots, N \}$$
Set of binary questions (Q): Predefined splits of the form $\text{Is } x_m \le c?$ (numerical) or $\text{Is } x_m \in S?$ (categorical)
Pruning parameters: Complexity parameter $\alpha$, minimum node size, maximum depth
Outputs¶
- Decision tree $(T)$: Hierarchical structure partitioning the feature space
- Classifier $(d(x))$: Function mapping feature vectors to class labels
- Class probabilities $p(j \mid t)$: Probability estimates for each class in terminal node $t$
Hyperparameters¶
- Splitting criterion: Gini index, entropy, or misclassification error
- $N_{\min}$: Minimum cases required for node splitting
- $V$-fold cross-validation: Typically (V=10) for pruning parameter selection
- $\alpha$: Complexity parameter for cost-complexity pruning
References¶
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.
Loh, Wei-Yin, and Yu-Shan Shih. "Split Selection Methods for Classification Trees." Statistica Sinica, vol. 7, no. 4, 1997, pp. 815–840. JSTOR, www.jstor.org/stable/24306157.
Breiman, Leo, et al. Classification and Regression Trees. Routledge, 2011.
Loh, Wei-Yin, and Nunta Vanichsetakul. "Tree-structured classification via generalized discriminant analysis." Journal of the American Statistical Association 83.403 (1988): 715-725. https://www.jstor.org/stable/2289295
Loh, Wei‐Yin. "Classification and regression trees." Wiley interdisciplinary reviews: data mining and knowledge discovery 1.1 (2011): 14-23. https://doi.org/10.1002/widm.8
Code¶
# ============================================================
# Install packages
# ============================================================
!pip install --quiet yfinance graphviz pandas numpy matplotlib seaborn scikit-learn IPython mlxtend
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from mlxtend.plotting import plot_decision_regions
warnings.filterwarnings("ignore")
# ============================================================
# 1) Load & Prepare Financial Data
# ============================================================
np.random.seed(42)
def load_financial_data(ticker="AAPL", start="2020-01-01", end="2022-12-31"):
df = yf.download(ticker, start=start, end=end, auto_adjust=False)
df["Return"] = df["Adj Close"].pct_change()
df["MA5"] = df["Adj Close"].rolling(5).mean()
df["MA20"] = df["Adj Close"].rolling(20).mean()
df.dropna(inplace=True)
df["Target"] = (df["Return"].shift(-1) > 0).astype(int) # Next-day up or down
return df
df = load_financial_data("AAPL", "2020-01-01", "2022-12-31")
features = ["Return", "MA5", "MA20"]
X, y = df[features], df["Target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# ============================================================
# 2) Exploratory Data Analysis
# ============================================================
sns.pairplot(pd.concat([X, y.rename("Target")], axis=1), hue="Target", corner=True)
plt.suptitle("Figure 1: Pairwise Feature Plot", y=1.02)
plt.show()
sns.heatmap(X.corr(), annot=True, cmap="coolwarm")
plt.title("Figure 2:Feature Correlation Heatmap")
plt.show()
# ============================================================
# 3) Decision Tree Variants (No / Pre / Post Pruning)
# ============================================================
# Base (no pruning, but controlled depth)
tree_no_prune = DecisionTreeClassifier(criterion="gini", random_state=42) #random_state=42 for reproducibility
tree_no_prune.fit(X_train, y_train)
# Pre-pruning
tree_pre = DecisionTreeClassifier(criterion="gini", max_depth=3, min_samples_leaf=10, random_state=42)
tree_pre.fit(X_train, y_train)
# Post-pruning (Cost Complexity)
path = tree_no_prune.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
alpha_choice = ccp_alphas[len(ccp_alphas)//2]
tree_post = DecisionTreeClassifier(criterion="gini", max_depth=3, ccp_alpha=alpha_choice, random_state=42)
tree_post.fit(X_train, y_train)
# ============================================================
# 4) Visualize Trees Separately
# ============================================================
for title, model in [
("Figure 3: No Pruning (max_depth=none)", tree_no_prune),
("Figure 4:Pre-Pruning (max_depth=3, min_samples_leaf=10)", tree_pre),
(f"Figure 5:Post-Pruning (ccp_alpha={alpha_choice:.3f})", tree_post)
]:
plt.figure(figsize=(12,6))
plot_tree(model, filled=True, feature_names=X.columns, class_names=["Down","Up"])
plt.title(title)
plt.show()
# ============================================================
# 5) Evaluation (Confusion Matrices Separately)
# ============================================================
for title, model in [
("Table 1: No Pruning (max_depth=none)", tree_no_prune),
("Table 2: Pre-Pruning (max_depth=3, min_samples_leaf=10)", tree_pre),
(f"Table 3: Post-Pruning (ccp_alpha={alpha_choice:.3f})", tree_post)
]:
y_pred = model.predict(X_test)
print(f"{title}")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=["Down","Up"], yticklabels=["Down","Up"])
plt.title(f"Confusion Matrix - {title}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# ============================================================
# 6) Feature Importances
# ============================================================
importances = tree_no_prune.feature_importances_
sns.barplot(x=importances, y=features)
plt.title("Figure 6: Feature Importances - Decision Tree (No Pruning)")
plt.show()
# ============================================================
# 7) Decision Boundary (2 features only: Return & MA5)
# ============================================================
X_viz = X[["Return","MA5"]]
X_train_viz, X_test_viz, y_train_viz, y_test_viz = train_test_split(
X_viz, y, test_size=0.3, random_state=42, stratify=y
)
tree_viz = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_viz.fit(X_train_viz, y_train_viz)
plt.figure(figsize=(6,5))
plot_decision_regions(X_test_viz.values, y_test_viz.values, clf=tree_viz, legend=2)
plt.xlabel("Return")
plt.ylabel("MA5")
plt.title("Figure 7: Decision Boundary (2 features only)")
plt.show()
[*********************100%***********************] 1 of 1 completed
Table 1: No Pruning (max_depth=none)
Accuracy: 0.5135135135135135
precision recall f1-score support
0 0.50 0.47 0.49 108
1 0.53 0.55 0.54 114
accuracy 0.51 222
macro avg 0.51 0.51 0.51 222
weighted avg 0.51 0.51 0.51 222
Table 2: Pre-Pruning (max_depth=3, min_samples_leaf=10)
Accuracy: 0.5045045045045045
precision recall f1-score support
0 0.47 0.15 0.23 108
1 0.51 0.84 0.64 114
accuracy 0.50 222
macro avg 0.49 0.50 0.43 222
weighted avg 0.49 0.50 0.44 222
Table 3: Post-Pruning (ccp_alpha=0.003)
Accuracy: 0.5045045045045045
precision recall f1-score support
0 0.47 0.15 0.23 108
1 0.51 0.84 0.64 114
accuracy 0.50 222
macro avg 0.49 0.50 0.43 222
weighted avg 0.49 0.50 0.44 222
Interpretation of Results
We used a classification tree to predict the next-day price movement of a financial asset. Three financial indicators were used to predict whether the next day's stock price would go "Up" (class 1) or "Down" (class 0): daily percentage price change (Return), 5-day moving average (MA5), and 20-day moving average (MA20).
The Pairwise Feature Plot (Figure 1) and Feature Correlation Heatmap (Figure 2) show a very strong correlation (0.99) between MA5 and MA20, which is expected since both derive from the same price data, and while this high collinearity doesn't affect decision trees, it's worth noting, whereas Return shows very little correlation with other features, typical of financial market data.
A decision tree grown without constraints will aim for 100% training accuracy, resulting in a complex, overly specific model as shown in the unpruned tree (Figure 3), which is difficult to interpret and likely memorized noise rather than underlying patterns, confirmed by its low test accuracy of 51.5% (Table 1) and numerous misclassifications.
Pruning, either pre-pruning with limited depth and minimum samples per leaf (Figure 4) or post-pruning with a complexity parameter (Figure 5), creates a simpler, more robust, and interpretable tree, reflected in the decision boundary plot (Figure 7) with simple rectangular regions. The pruned models achieve test accuracy of 50.45% (Tables 2 and 3), slightly lower than the unpruned tree but significantly simpler and more interpretable, and the feature importance analysis (Figure 6) indicates that Return and MA5 were most influential for splits, while MA20 was less critical due to its correlation with MA5, showing the model correctly identified the most useful variables.
STEP 3: TECHNICAL SECTION¶
Hyperparameter tuning represents the systematic process of optimizing the configuration settings of machine learning algorithms that are not learned from the data but must be specified beforehand. This process is fundamental to balancing the bias-variance tradeoff, ensuring that models capture genuine patterns in financial data without overfitting to noise (Belkin 15851). The mathematical foundation of this balance can be expressed through the expected prediction error decomposition: $$ \mathbb{E}\big[(y - \hat{y})^2\big] = \text{Bias}(\hat{y})^2 + \text{Var}(\hat{y}) + \sigma^2 $$ where $\sigma^2$ represents the irreducible error. Hyperparameter tuning directly minimizes the sum of the squared bias and variance terms, moving the model toward the optimal complexity for generalization.
The principal methodologies for this optimization are Grid Search, Random Search, and Bayesian Optimization. Grid Search, as employed in our Elastic Net model, performs an exhaustive search over a predefined set of hyperparameters, evaluating each combination typically using cross-validation to mitigate overfitting (Dezhkam 436). The objective is to minimize a loss function $L(\theta, \lambda)$ over parameters $\theta$ and hyperparameters $\lambda$: $$ \theta^*, \lambda^* = \arg\min_{\theta, \lambda} L(\theta, \lambda) $$ For models where the number of clusters or components is unknown, analytical methods like the elbow method or silhouette analysis are preferred. These methods evaluate intrinsic model quality without a direct prediction target.
In our Elastic Net implementation, two hyperparameters govern the regularization behavior. The Elastic Net penalty combines L1 (Lasso) and L2 (Ridge) penalties into the objective function: $$ \min_{\beta} \Big\{ \| y - X\beta \|_2^2 + \lambda \Big[ \frac{1-\rho}{2}\|\beta\|_2^2 + \rho \|\beta\|_1 \Big] \Big\} $$ Here, the hyperparameter $\alpha$ ($\lambda$) controls the overall strength of regularization, while l1_ratio ($\rho$) determines the mix between the two penalties. Our grid search over $\alpha \in [0.001, 10.0]$ and l1_ratio $\in [0, 1]$ identified the optimal combination that minimized cross-validated mean squared error, demonstrating how automated search efficiently navigates this two-dimensional space to enhance out-of-sample predictive performance, a critical requirement for factor investing strategies.
For unsupervised learning models like K-Means Clustering, the primary hyperparameter is n_clusters ($k$). The algorithm aims to minimize inertia, which is the sum of squared distances between samples and their nearest cluster centroid: $$ \min_{C} \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2 $$ However, minimizing inertia alone leads to more clusters until each point is its own cluster. Therefore, we employed silhouette analysis, which measures how similar an object is to its own cluster compared to other clusters. The silhouette score for a single sample is given by: $$ s(i) = \frac{b(i) - a(i)}{\max \{a(i), b(i)\}} $$ where $a(i)$ is the mean intra-cluster distance and $b(i)$ is the mean nearest-cluster distance. By selecting the $k$ that maximizes the average silhouette score, we obtained a data-driven estimate of the natural number of market regimes present in the data, a technique with applications in portfolio regime detection.
Classification Trees introduce a distinct set of hyperparameters focused on controlling tree complexity through pruning. Pre-pruning parameters like max_depth and min_samples_leaf restrict growth during training. Post-pruning, specifically Cost-Complexity Pruning (CCP), uses the hyperparameter ccp_alpha to penalize tree complexity. The cost-complexity function is:
$$ R_{\alpha}(T) = R(T) + \alpha |\tilde{T}| $$
where $R(T)$ is the total misclassification error of the tree $T$, $|\tilde{T}|$ is the number of leaf nodes, and $\alpha$ is the complexity parameter. A sequence of $\alpha$ values is generated, and the optimal subtree is selected via cross-validation, effectively pruning away branches that provide the least predictive power per leaf node. This process enhances the tree's generalizability, which is vital for creating robust trading signals from technical indicators.
References¶
Belkin, Mikhail, et al. "Reconciling modern machine-learning practice and the classical bias–variance trade-off." Proceedings of the National Academy of Sciences 116.32 (2019): 15849-15854. https://doi.org/10.1073/pnas.1903070116
Dezhkam, Arsalan, et al. "A Bayesian-based classification framework for financial time series trend prediction." The Journal of supercomputing 79.4 (2023): 4622-4659. https://doi.org/10.1007/s11227-022-04834-4
STEP 4: MARKETING ALPHA¶
In the relentless pursuit of alpha, the ability to distinguish genuine signal from pervasive market noise represents the ultimate competitive edge. Traditional quantitative models often falter when confronted with high-dimensional, non-linear, and dynamically evolving financial data. Machine learning methodologies including Elastic Net regression, K-Means Clustering, and Classification Trees demonstrate how they systematically overcome these challenges to unlock persistent alpha generation. These techniques form the foundation for building more resilient, adaptive, and profitable investment processes that thrive in complex data landscapes characterized by high-dimensional factor spaces, shifting market regimes, and the critical need for transparent, actionable decisions.
The fundamental challenge in quantitative finance can be framed as an optimization problem under extreme uncertainty. We seek a predictive function $f(X)$ that maps a high-dimensional set of features $X$ (e.g., factors, prices, volumes) to a target $y$ (e.g., future returns) while maximizing the information ratio and minimizing exposure to spurious correlations. Machine learning excels by introducing structured regularization and adaptive learning into this search. The Elastic Net model directly addresses the "curse of dimensionality" and multicollinearity—common pitfalls in factor investing—through its hybrid penalty term: $$ \min_{\beta} \left\{ \|y - X\beta\|^2_2 + \lambda \left[ \frac{(1-\rho)}{2} \|\beta\|^2_2 + \rho \|\beta\|_1 \right] \right\} $$
This approach simultaneously performs feature selection (via the L1 norm) and coefficient shrinkage (via the L2 norm), ensuring resulting portfolio strategies are both sparse and robust. Our analysis demonstrated concrete advantages: properly tuned Elastic Net achieved near-perfect prediction accuracy ($R^2 = 1.0$) with a 99.9% reduction in test Mean Squared Error compared to baseline models, while correctly identifying true predictive factors from noisy candidates.
Beyond supervised prediction, machine learning provides powerful tools for market regime identification (Akioyamen, Yi and Hussien Hussien, 2-3). K-Means Clustering offers an unsupervised approach to segment market environments by minimizing within-cluster variance: $$ \min_{C} \sum_{k=1}^K \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 $$ Our application to 17 diversified stocks across multiple sectors revealed two distinct regimes with clear financial implications. Cluster 0 (295 observations) represented a "risk-on" environment with higher volatility (0.0244) and returns (0.00074), while Cluster 1 (651 observations) constituted a "risk-off" regime with lower volatility (0.0196) and returns (0.00037). This empirically validated partitioning enables dynamic portfolio adjustment—increasing exposure during risk-on periods for alpha generation and reducing risk during risk-off periods for drawdown protection.
For tactical decision-making, interpretability remains as crucial as predictive power. Classification Trees provide a transparent, rule-based framework built through recursive partitioning of feature space to maximize information gain (Gajowniczek and Ząbkowski 8-9). The resulting "if-then" rules offer white-box logic that portfolio managers can understand and trust, while naturally capturing non-linear relationships and interaction effects that linear models miss. Pruning techniques optimize tree complexity, enhancing out-of-sample performance while maintaining intelligibility.
The true alpha generation emerges from strategic integration of these techniques into a cohesive analytical pipeline. A comprehensive investment management system begins with continuous regime detection (K-Means) to identify market environments, applies rigorous factor selection (Elastic Net) specific to each regime, and transforms optimized signals into executable rules (Classification Trees). This integrated approach systematically addresses dimensionality through intelligent factor selection, adapts to non-stationarity through regime detection, and maintains transparency through interpretable decision rules.
Team Alpha's methodology transforms theoretical models into practical alpha generation, offering more than just algorithms—we deliver a methodological edge backed by empirical validation. The demonstrated results include robust signal identification (99.9% MSE improvement), dynamic regime-aware strategies, and transparent decision frameworks. By embracing this ML-driven framework, investment managers gain enhanced alpha generation through more robust signals, improved risk management via dynamic exposure adjustment, and operational efficiency through automated analysis and interpretable strategies. In an increasingly competitive landscape that favors those who systematically harness technology and data, Team Alpha's demonstrated ability to identify meaningful market regimes, optimize predictive factors, and create actionable decision rules positions us to deliver sustainable competitive advantage, transforming complex data into consistent alpha generation.
Reference¶
Akioyamen, Peter, Yi Zhou Tang, and Hussien Hussien. "A hybrid learning approach to detecting regime switches in financial markets." Proceedings of the First ACM International Conference on AI in Finance. 2020. https://doi.org/10.1145/3383455.3422521
Gajowniczek, Krzysztof, and Tomasz Ząbkowski. "Interactive decision tree learning and decision rule extraction based on the ImbTreeEntropy and ImbTreeAUC packages." Processes 9.7 (2021): 1107. https://doi.org/10.3390/pr9071107
STEP 5: LEARN MORE¶
The following references provide foundational and advanced insights into the machine learning methodologies explored in this report. This curated list emphasizes theoretical foundations, financial applications, and empirical validations of machine learning's strengths in quantitative finance.
Journal Articles¶
Elastic Net Regression
- Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies, vol. 33, no. 5, May 2020, pp. 2223–2273. DOI.org (Crossref), https://doi.org/10.1093/rfs/hhaa009.
Boloș, Mihaela-Ioana, et al. "K-Means Clustering for Portfolio Optimization: Symmetry in Risk–Return Tradeoff, Liquidity, Profitability, and Solvency." Symmetry, vol. 17, no. 6, 29 May 2025, article no. 847, doi:10.3390/sym17060847.
Mba, Jules Clement, and Ehounou Serge Eloge Florentin Angaman. "A K-Means Classification and Entropy Pooling Portfolio Strategy for Small and Large Capitalization Cryptocurrencies." Entropy, vol. 25, no. 8, 14 Aug. 2023, article no. 1208, doi:10.3390/e25081208.
Classification Trees
K-Means Clustering
Loh, Wei-Yin, and Yu-Shan Shih. "Split Selection Methods for Classification Trees." Statistica Sinica, vol. 7, no. 4, 1997, pp. 815–840. JSTOR, www.jstor.org/stable/24306157.
Loh, Wei-Yin, and Nunta Vanichsetakul. "Tree-structured classification via generalized discriminant analysis." Journal of the American Statistical Association 83.403 (1988): 715-725. https://www.jstor.org/stable/2289295
Loh, Wei‐Yin. "Classification and regression trees." Wiley interdisciplinary reviews: data mining and knowledge discovery 1.1 (2011): 14-23. https://doi.org/10.1002/widm.8
Textbooks¶
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.
James, Gareth, et al. An Introduction to Statistical Learning: with Applications in R. Springer, 2013.
Breiman, Leo, et al. Classification and Regression Trees. Routledge, 2011.
Conclusion¶
This project provides a valuable handbook for quantitative finance practitioners, demonstrating the practical application of machine learning methodologies in developing robust trading strategies.
The Elastic Net regression, a penalized linear method, offers a robust solution for feature selection in high-dimensional and multicollinear environments, providing a parsimonious and directly interpretable model at the cost of potentially overlooking non-linear relationships. In contrast, K-Means clustering, an unsupervised technique, excels at uncovering latent data structures and defining market regimes without prior assumptions, though its performance is highly sensitive to the initial choice of centroids and the curse of dimensionality, and its clusters require careful ex-post interpretation. Finally, Classification Trees provide a highly intuitive, non-parametric approach that naturally models non-linearities and interactions, offering clear, rule-based explanations for its predictions; however, this method is prone to overfitting and can be unstable, where small changes in the data can lead to significantly different tree structures, necessitating careful pruning and validation. The selection of a given method is therefore not a matter of outright superiority, but rather a strategic decision that balances the need for a transparent, explainable model against the potential predictive gains of more flexible, but opaque, algorithms.
Full references¶
Akioyamen, Peter, Yi Zhou Tang, and Hussien Hussien. "A Hybrid Learning Approach to Detecting Regime Switches in Financial Markets." Proceedings of the First ACM International Conference on AI in Finance (2020). https://doi.org/10.1145/3383455.3422521
Belkin, Mikhail, et al. "Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off." Proceedings of the National Academy of Sciences 116.32 (2019): 15849-15854. https://doi.org/10.1073/pnas.1903070116
Breiman, Leo, et al. Classification and Regression Trees. Routledge, 1984.
Breiman, Leo, et al. Classification and Regression Trees. Routledge, 2011.
Dezhkam, Arsalan, et al. "A Bayesian-Based Classification Framework for Financial Time Series Trend Prediction." The Journal of Supercomputing 79.4 (2023): 4622-4659. https://doi.org/10.1007/s11227-022-04834-4
Fan, Jianqing, Jinchi Lv, and Lingzhou Qi. "Sparse High-Dimensional Models in Economics." Annual Review of Economics 3 (2011): 291-317. https://doi.org/10.1146/annurev-economics-061109-080451
Gajowniczek, Krzysztof, and Tomasz Ząbkowski. "Interactive Decision Tree Learning and Decision Rule Extraction Based on the ImbTreeEntropy and ImbTreeAUC Packages." Processes 9.7 (2021): 1107. https://doi.org/10.3390/pr9071107
Giannone, Domenico, Michele Lenza, and Giorgio E. Primiceri. "Economic Predictions with Big Data: The Illusion of Sparsity." Econometrica 89.5 (2021): 2409-2437. https://doi.org/10.3982/ECTA16935
Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies 33.5 (2020): 2223-2273. https://doi.org/10.1093/rfs/hhaa009
Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, 2015.
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.
James, Gareth, et al. An Introduction to Statistical Learning: With Applications in R. Springer, 2013.
Kaufman, Leonard, and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
Loh, Wei-Yin. "Classification and Regression Trees." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 14-23. https://doi.org/10.1002/widm.8
Loh, Wei-Yin, and Yu-Shan Shih. "Split Selection Methods for Classification Trees." Statistica Sinica 7.4 (1997): 815-840.
López de Prado, Marcos. Advances in Financial Machine Learning. Wiley, 2018.
Zou, Hui, and Trevor Hastie. "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005): 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Website¶
StatQuest. "Video Index." https://statquest.org/video-index/