Introduction¶

Machine learning (ML) has evolved from a competitive advantage to a basic necessity in quantitative finance. However, the path from a theoretical model to a sound, effective strategy is fraught with methodological pitfalls. Selecting a powerful algorithm is not enough; its learning process must be carefully engineered for it to succeed. This project goes beyond a straightforward comparison of classifiers to address three important problems in the machine learning pipeline: hyperparameter optimization, generalization error management, and the strategic combination of models.

We start by discussing the crucial topic of hyperparameter optimization (HPO). Although hyperparameters have a significant impact on an ML model's performance, practitioners frequently use less-than-ideal defaults. From the laborious Grid Search to the more effective Random Search (Bergstra & Bengio 285) to the cutting-edge Bayesian Optimization (Frazier 1), which employs probabilistic models to direct the search for ideal configurations, we methodically assess a range of HPO approaches. This comparison offers a guide for effectively negotiating intricate parameter spaces in order to realize the full potential of a model.

Secondly, we address the fundamental statistical conundrum of the Bias-Variance Tradeoff. Overfitting occurs when a model that works flawlessly on training data performs disastrously on fresh data. We break down the formal decomposition of generalization error and show how regularization techniques can help manage model complexity (Hastie, Tibshirani & Friedman). We also examine the contemporary "double descent" phenomenon (Belkin et al., 15853), which contradicts classical theory by demonstrating the perfect generalization of highly over-parameterized models, like contemporary neural networks, thereby balancing theory and empirical success.

Lastly, we look into the potential for synergy in ensemble learning. We illustrate that combining predictions from several algorithms, including Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), and Neural Networks (NN), can produce better and more consistent performance than depending on just one model. Following the development of the fundamentals of boosting (e.g., XGBoost) and bagging (e.g., Random Forests), we suggest and put into practice an advanced Stacking (Stacked Generalization) framework (Wolpert, 253).

Optimizing Hyperparameters¶

Hyperparameters are configurations that control the learning process and model architecture. They are fixed when the training begins. Hyperparameters are not learnt from the data in contrast to model parameters. But they are hugely important for how good a model is and how much it generalizes (James et al.).

Using the default hyperparameters in financial machine learning would correspond to employing a trading algorithm that fits all. It may work, in some instances, but it's not able to capture the set of unique features found in market data. Hyperparameters are critical constraints that influence the behavior of a model. We demonstrate that model-specific optimization methodologies intended to be only conceptual are in fact an explicit alpha generator and a necessary bullet against overfitting in evolving financial markets.

Technical Section¶

The performance that machine learning models achieve in financial application is dominated by their hyperparameter settings. Defined prior to training, these parameters describe the model's architecture and influence its learning dynamics as well as generalization. Formal Objectives The optimization expression to be minimized takes the following form:

$$ \theta^* = \arg\min_{\theta \in \Theta} \mathbb{E}_{(X,y) \sim \mathcal{D}}[\mathcal{L}(f_\theta(X), y)], $$

where $\mathcal{D}$ is the data distribution, $\mathcal{L}$ is the loss function that measures prediction error, $\theta$ denotes hyperparameters, and $\Theta$ defines the hyperparameter space.

Model-Specific Hyperparameter Spaces¶

Support Vector Machines (SVM)

The performance of the SVM is highly sensitive to three hyperparameters:

  • Regularization Parameter ($C$): Balances the margin maximization and the minimization of training error:

    • High $C$: Narrow margin, more non-linear boundary, risk of overfitting
    • Low $C$: wider margin, simpler boundary generalizes better on noisy financial data
  • Kernel Function: This is the function that specifies the transformation of feature space:

    • Linear: $K(x_i, x_j) = x_i^T x_j$
    • Radial basis function (RBF): $K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$
    • Polynomial: $K(x_i, x_j) = (x_i^T x_j + r)^d$
  • Kernel Coefficient ($\gamma$): regulates the training examples' influence radius for the RBF kernel:

    • High $\gamma$: Localized, intricate decision boundaries
    • Low $\gamma$: Generalized, smooth boundaries

Neural Networks (NN)

The NN hyperparameter space encompasses architecture, optimization, and regularization:

  • Architectural Parameters: Model capacity is defined by the number of layers $L$ and neurons per layer $n_l$.

    $$ \text{Capacity} \propto \sum_{l=1}^L n_l \cdot n_{l-1} $$

Optimization Parameters: The gradient descent step size is controlled by the learning rate $\eta$.

$$ w_{t+1} = w_t - \eta \nabla \mathcal{L}(w_t) $$

  • Regularization Parameters - L1/L2 coefficients $\lambda$ and dropout rate $p$ guard against overfitting:

    $$ \mathcal{L}_{reg} = \mathcal{L}_{data} + \lambda ||w||_p $$

LDA, or linear discriminant analysis

Regarding financial applications where there may be $p > n$ scenarios: Shrinkage Parameter ($\alpha$): Regularizes the estimation of covariance matrices:

$$ \Sigma_{shrink} = (1-\alpha)\Sigma + \alpha\frac{\text{tr}(\Sigma)}{p}I $$

The choice of numerical algorithm (eigen, lsqr, or svd) has an impact on efficiency and stability.

Optimization Methodologies¶

Grid Search Comprehensive search over grids of predefined parameters with computational complexity $O(m^d)$:

$$ \theta^* = \arg\min_{\theta \in \{\theta_1 \times \theta_2 \times \cdots \times \theta_d\}} \frac{1}{K}\sum_{k=1}^K \mathcal{L}_{val}^{(k)}(f_\theta) $$

Hyperparameters from probability distributions are sampled by Random Search:

$$\theta_i \sim P(\theta), \quad i = 1, 2,..., N$$

shown to be more effective in high-dimensional spaces than grid search.

Gaussian processes are used in Bayesian Optimization to model the objective function:

$$ f(\theta) \sim \mathcal{GP}(\mu(\theta), k(\theta, \theta')) $$

balances exploration and exploitation through the use of acquisition functions: Anticipated Enhancement:

$$ \alpha_{EI}(\theta) = \mathbb{E}[\max(0, f(\theta) - f(\theta^+))] $$

Upper Confidence Bound:

$$ \alpha_{UCB}(\theta) = \mu(\theta) + \kappa\sigma(\theta) $$

Financial Validation Procedure¶

Temporal cross-validation is crucial due to the non-stationarity of financial data:

$$ \mathcal{L}_{TSCV} = \frac{1}{T}\sum_{t=1}^T \mathcal{L}(f_\theta(X_{[t-w:t]}), y_{[t+1]}) $$

where $T$ is the number of time splits and $w$ is the size of the rolling window.

Hyperparameter Comparisons

Model Hyperparameter Function in Financial Task Ideal Range/Value
SVM $C$ (Regularization) Overfitting to market noise is balanced with model complexity. $[10^{-3}, 10^3]$
$\gamma$ (Kernel Width) Indicates a single data point's influence range; this is important for localizing patterns. $[10^{-4}, 10]$
$kernel$ Identifies non-linear relationships in the data. RBF (for intricate patterns)
Neural Network Learning Rate Regulates step size for convergence; crucial for training stability. $[10^{-5}, 10^{-1}]$
Layers / Neurons Assesses the model's ability to learn market complexity. 1–5 layers; 16–512 neurons/layer
Dropout Prevents overfitting to temporal artifacts by randomly turning off neurons. $[0.1, 0.5]$
LDA shrinkage Maintains covariance estimates in high-dimensional data. $[0, 1]$
solver Assures numerical stability in the covariance matrix computation. 'svd' (when features > samples)

Non-Technical Section¶

In constructing financial forecasting models, a primary challenge lies in calibrating models to achieve both high predictive performance and robustness, where robustness refers to the model's ability to maintain stable performance amidst the inherent non-stationarity and volatility characteristic of financial markets. Hyperparameter optimization serves as a systematic framework for adapting model specifications to diverse market regimes, thereby enhancing generalizability and mitigating the risks associated with overfitting to particular historical conditions.

Each predictive model has different types of logical control settings that dictate how it behaves, and these are referred to as hyperparameters. They control:

  • The most complex the model can get
  • How quickly the model should learn from historical data
  • How the model navigates the trade-off between accurately learning genuine market dynamics and avoiding the incorporation of random noise. While some hyperparameter configurations may prioritize noise reduction and enhance stability, they risk failing to fully capture meaningful market signals, thereby limiting both predictive performance and potential benefits for risk management.

Our Systematic Three-Phase Optimization Process We use hard and systematic treatment to determine the best setting:

Phase 1: Definition of Strategic Parameter Space The procedure is as follows: for each meta-model type, first of all the critical control parameters are identified:

  • Sensitivity to patterns in the market and resistance to noise is adjusted for pattern recognition-based models (e.g. support vector machines (SVM).
  • We study network depth,learning speed, and whether to have built-in risk controls in the context of neural networks.
  • Stability parameters in statistical models are set for high-module-but-low-stability financial data.

Phase 2: Intelligent Optimization Search The testing of different setups is done using advance optimization methods:

  1. Wide Scanning: The scan is performed in the wide zone to find regions of interest.
  2. Focused refinement: The most promising regions are then considered in greater detail using smart algorithms to find the best combination of parameters.
  3. Cross-Validation: We carry out the experiments on multiple historical periods to test each configuration.

Phase 3: Financial Reality Validation All setups are checked with a time-aware test procedure which respects the special nature of financial markets:

  • Model validation is done in bull, bear and flat markets.
  • Stability can be examined during periods of robust volatility and markets calamities.
  • It is demonstrated that model performance is not reliant on certain historical artifacts.

There are also some concerns cited by strategists that the optimization process tackles directly:

Answering the Critical Questions The strategists' concerns are explicitly addressed by our optimization:

  1. “How do we know that the models have the correct parameters?” The advantage comes from the rigorous testing on thousands of systems for different market types that we use to select for systems whose performance continued strong and stable with time.

  2. "What's the outlook for its performance going forward?"

    Our temporal validation technique approximates real-world deployment, suggesting substantial evidence that the optimized models will exhibit good performance? Its performance characteristics.

  3. "Are these settings trustworthy?"

    Yes, all configurations are stress-tested for the peculiar challenges posed by financial data such as regime changes, noise , non-stationarity etc.

Practical Business Impact This is a strict approach that pays off:

  • Greater Predictive Power: Enhanced models better capture opportunities in the market.

  • Enhanced Risk Management: Properly configured hyperparameter optimization effectively mitigates overfitting and reduces the likelihood of strategy underperformance or failure.

  • Computational Efficiency: intelligent search techniques can find the best setup faster, which means a lower cost in resources used for development.

  • Performance Consistency: The reliability and stability of model performance are maintained even as market conditions fluctuate, ensuring that predictive accuracy does not degrade when exposed to evolving financial environments.

In [ ]:
import warnings
warnings.filterwarnings('ignore')
!pip install --upgrade pandas==2.2.2 tensorflow==2.19.0 scikeras scikit-optimize yfinance pandas_ta quantstats shap--quiet
In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import yfinance as yf
import pandas_ta as ta  # technical analysis helpers

# Machine Learning / optimization
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.base import clone

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from scikeras.wrappers import KerasClassifier

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# -------------------------------------------------------------------------
# CONFIGURATION - OPTIMIZED FOR SPEED
# -------------------------------------------------------------------------
# Set to True for a quick test run with few iterations.
# Set to False for a full, thorough search (will take a long time).
FAST_MODE = True
# -------------------------------------------------------------------------

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

# --- Define search parameters based on FAST_MODE ---
if FAST_MODE:
    print("--- RUNNING IN ULTRA-FAST MODE ---")
    # Skip expensive operations
    SKIP_GRID_SEARCH = True
    SKIP_RANDOM_SEARCH = True  
    SKIP_NN = True
    # SVM parameters
    svm_c_grid = [1]
    svm_gamma_grid = [0.01]
    svm_kernel_grid = ['rbf']
    svm_random_iters = 2
    svm_bayes_iters = 3
    # NN parameters
    nn_search_iters = 2
    nn_epochs = 3
    # LDA parameters
    lda_shrinkage_grid = [0.0, 0.5, 1.0]
    n_splits = 3
else:
    print("--- RUNNING IN FULL_MODE (this will take time) ---")
    # Run everything
    SKIP_GRID_SEARCH = False
    SKIP_RANDOM_SEARCH = False
    SKIP_NN = False
    # SVM parameters
    svm_c_grid = [0.1, 1, 10, 100]
    svm_gamma_grid = [0.001, 0.01, 0.1, 1]
    svm_kernel_grid = ['rbf', 'linear']
    svm_random_iters = 20  # Reduced from 50
    svm_bayes_iters = 20   # Reduced from 50
    # NN parameters
    nn_search_iters = 10   # Reduced from 20
    nn_epochs = 30         # Reduced from 50
    # LDA parameters
    lda_shrinkage_grid = np.arange(0.0, 1.01, 0.1)  # 11 values instead of 21
    n_splits = 5

# -------------------------------------------------------------------------
# 1) DATA ACQUISITION - OPTIMIZED
# -------------------------------------------------------------------------
print("Downloading data...")
ticker = 'SPY'
# Use smaller date range for faster testing
if FAST_MODE:
    data = yf.download(ticker, start='2020-01-01', end='2023-12-31', auto_adjust=False)
else:
    data = yf.download(ticker, start='2010-01-01', end='2023-12-31', auto_adjust=False)

# Flatten MultiIndex columns if yfinance returned them
if isinstance(data.columns, pd.MultiIndex):
    data.columns = data.columns.droplevel(1)

print("Raw data shape:", data.shape)

# Reset index so 'Date' becomes a column
data = data.reset_index()

# -------------------------------------------------------------------------
# 2) FEATURE ENGINEERING - Technical indicators
# -------------------------------------------------------------------------
print("Calculating technical indicators...")
# use a time-indexed DF for pandas_ta
data_indexed = data.set_index('Date')

# compute indicators - only essential ones
rsi = ta.rsi(data_indexed['Close'], length=14)
macd = ta.macd(data_indexed['Close'], fast=12, slow=26, signal=9)
bbands = ta.bbands(data_indexed['Close'], length=20)
obv = ta.obv(data_indexed['Close'], data_indexed['Volume'])

# helper to find columns by pattern(s)
def find_col(df, patterns):
    """Return the first column name in df whose string contains any pattern in patterns (case-insensitive)."""
    if df is None:
        return None
    for pat in patterns:
        for c in df.columns:
            if pat.lower() in str(c).lower():
                return c
    return None

# --- Add RSI safely ---
if isinstance(rsi, pd.Series):
    data['RSI_14'] = rsi.values
elif isinstance(rsi, pd.DataFrame) and len(rsi.columns) >= 1:
    data['RSI_14'] = rsi.iloc[:, 0].values

# --- MACD: robust column detection ---
macd_upper = find_col(macd, ['MACD', 'macd'])
macd_signal = find_col(macd, ['MACDs', 'signal'])
macd_hist = find_col(macd, ['MACDh', 'hist', 'histogram'])

# fallback: positional if names not found
if macd_upper is None or macd_signal is None or macd_hist is None:
    if isinstance(macd, pd.DataFrame) and macd.shape[1] >= 3:
        macd_upper, macd_signal, macd_hist = macd.columns[:3]

# assign MACD columns if available
if macd_upper is not None:
    data['MACD'] = macd[macd_upper].values
if macd_signal is not None:
    data['MACD_signal'] = macd[macd_signal].values
if macd_hist is not None:
    data['MACD_hist'] = macd[macd_hist].values

# --- Bollinger Bands: robust detection ---
bb_upper_col = find_col(bbands, ['BBU', 'upper', 'bb_u', 'bb_upper'])
bb_middle_col = find_col(bbands, ['BBM', 'middle', 'bb_m', 'bb_middle'])
bb_lower_col = find_col(bbands, ['BBL', 'lower', 'bb_l', 'bb_lower'])

# fallback to positional order if pattern search fails but there are >=3 cols
if bb_upper_col is None or bb_middle_col is None or bb_lower_col is None:
    if isinstance(bbands, pd.DataFrame) and bbands.shape[1] >= 3:
        bb_upper_col, bb_middle_col, bb_lower_col = bbands.columns[:3]

# assign BB columns
data['BB_upper'] = bbands[bb_upper_col].values
data['BB_middle'] = bbands[bb_middle_col].values
data['BB_lower'] = bbands[bb_lower_col].values

# --- OBV ---
if isinstance(obv, pd.Series):
    data['OBV'] = obv.values
elif isinstance(obv, pd.DataFrame) and obv.shape[1] >= 1:
    data['OBV'] = obv.iloc[:, 0].values

# --- Simplified engineered features ---
data['return_lag_1'] = data['Close'].pct_change(1)
data['return_lag_2'] = data['Close'].pct_change(2)
data['volatility_10d'] = data['Close'].pct_change().rolling(window=10).std()

# --- Target ---
data['target'] = (data['Close'].shift(-1) > data['Close']).astype(int)

# -------------------------------------------------------------------------
# 3) CLEANING and PREPARATION
# -------------------------------------------------------------------------
# Drop rows with NaNs created by indicators
data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)

# Drop non-features
features_to_drop = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
for col in features_to_drop:
    if col in data.columns:
        data.drop(columns=col, inplace=True)

# Align X and y
X = data.drop(columns=['target'])
y = data['target']

print("\nPrepared data shapes:")
print("X shape:", X.shape)
print("y shape:", y.shape)
print("Feature columns:", list(X.columns))

# In FAST_MODE, use smaller dataset for testing
if FAST_MODE and len(X) > 1500:
    X = X[-1000:]
    y = y[-1000:]
    print(f"Reduced dataset to {len(X)} samples for faster testing")

# -------------------------------------------------------------------------
# 4) TIME-SERIES SPLIT
# -------------------------------------------------------------------------
tscv = TimeSeriesSplit(n_splits=n_splits)

# Quick visualization only in full mode
if not FAST_MODE:
    fig, ax = plt.subplots(figsize=(12, 3))
    for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
        ax.fill_betweenx([i-0.4, i+0.4], train_idx[0], train_idx[-1], color='tab:blue', alpha=0.6, label='Train' if i == 0 else "")
        ax.fill_betweenx([i-0.4, i+0.4], test_idx[0], test_idx[-1], color='tab:orange', alpha=0.8, label='Test' if i == 0 else "")
    ax.set_title('TimeSeriesSplit Cross-Validation Folds')
    ax.set_xlabel('Index')
    ax.set_ylabel('Fold')
    ax.legend()
    plt.tight_layout()
    plt.show()

# -------------------------------------------------------------------------
# 5) SVM PIPELINE & SEARCHES - OPTIMIZED
# -------------------------------------------------------------------------
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(random_state=RANDOM_STATE))
])

# Only run Bayesian Search in FAST_MODE (most efficient)
if not SKIP_GRID_SEARCH:
    svm_param_grid = {
        'svm__C': svm_c_grid,
        'svm__gamma': svm_gamma_grid,
        'svm__kernel': svm_kernel_grid
    }

    print("Running Grid Search for SVM...")
    grid_search = GridSearchCV(svm_pipeline, svm_param_grid, cv=tscv, n_jobs=-1, verbose=1)
    grid_search.fit(X, y)
    print("Grid Search best score:", grid_search.best_score_)
    print("Grid Search best params:", grid_search.best_params_)
else:
    print("Skipping Grid Search for SVM")
    grid_search = None

if not SKIP_RANDOM_SEARCH:
    svm_param_dist = {
        'svm__C': np.logspace(-3, 3, 7),  # Reduced from 10
        'svm__gamma': np.logspace(-4, 1, 6),  # Reduced from 10
        'svm__kernel': ['rbf', 'linear']  # Removed 'poly'
    }
    print("Running Randomized Search for SVM...")
    random_search = RandomizedSearchCV(svm_pipeline, svm_param_dist, n_iter=svm_random_iters, 
                                     cv=tscv, n_jobs=-1, verbose=1, random_state=RANDOM_STATE)
    random_search.fit(X, y)
    print("Random Search best score:", random_search.best_score_)
    print("Random Search best params:", random_search.best_params_)
else:
    print("Skipping Randomized Search for SVM")
    random_search = None

# Always run Bayesian (most efficient)
svm_param_space = {
    'svm__C': Real(1e-3, 1e3, prior='log-uniform'),
    'svm__gamma': Real(1e-4, 1e1, prior='log-uniform'),
    'svm__kernel': Categorical(['rbf', 'linear'])  # Removed 'poly'
}

print("Running Bayesian Optimization for SVM...")
bayes_search = BayesSearchCV(svm_pipeline, search_spaces=svm_param_space, n_iter=svm_bayes_iters, 
                           cv=tscv, n_jobs=-1, verbose=1, random_state=RANDOM_STATE)
bayes_search.fit(X, y)
print("Bayes Search best score:", bayes_search.best_score_)
print("Bayes Search best params:", bayes_search.best_params_)

# -------------------------------------------------------------------------
# 6) NEURAL NETWORK - SKIPPED IN FAST_MODE
# -------------------------------------------------------------------------
if not SKIP_NN:
    def create_nn_model(learning_rate=0.001, dropout_rate=0.2, num_layers=2, neurons_per_layer=32, l2_reg=0.01):
        model = Sequential()
        model.add(Dense(neurons_per_layer, input_dim=X.shape[1], activation='relu',
                        kernel_regularizer=tf.keras.regularizers.l2(l2_reg)))
        model.add(Dropout(dropout_rate))
        for _ in range(num_layers - 1):
            model.add(Dense(max(4, neurons_per_layer // 2), activation='relu',
                            kernel_regularizer=tf.keras.regularizers.l2(l2_reg)))
            model.add(Dropout(dropout_rate))
        model.add(Dense(1, activation='sigmoid'))
        optimizer = Adam(learning_rate=learning_rate)
        model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
        return model

    nn_classifier = KerasClassifier(model=create_nn_model, epochs=nn_epochs, verbose=0, random_state=RANDOM_STATE)
    nn_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', nn_classifier)
    ])

    nn_param_space = {
        'model__learning_rate': Real(1e-5, 1e-1, prior='log-uniform'),
        'model__dropout_rate': Real(0.1, 0.5, prior='uniform'),
        'model__num_layers': Integer(1, 3),
        'model__neurons_per_layer': Integer(16, 64),  # Reduced max from 128
        'model__l2_reg': Real(1e-4, 1e-1, prior='log-uniform'),
        'model__batch_size': Categorical([32, 64]),  # Reduced options
    }

    print("Running Bayesian Optimization for Neural Network...")
    nn_bayes_search = BayesSearchCV(
        nn_pipeline,
        search_spaces=nn_param_space,
        n_iter=nn_search_iters,
        cv=tscv,
        n_jobs=1,  # Keras doesn't play well with n_jobs > 1
        verbose=1,
        random_state=RANDOM_STATE,
        scoring='accuracy'
    )
    nn_bayes_search.fit(X, y)
    print("NN Bayes best score:", nn_bayes_search.best_score_)
    print("NN Bayes best params:", nn_bayes_search.best_params_)
else:
    print("Skipping Neural Network for speed")
    nn_bayes_search = None

# -------------------------------------------------------------------------
# 7) LDA - OPTIMIZED
# -------------------------------------------------------------------------
lda_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis())
])

lda_param_grid = {
    'lda__solver': ['lsqr', 'eigen'],
    'lda__shrinkage': lda_shrinkage_grid
}

print("Running Grid Search for LDA...")
lda_grid_search = GridSearchCV(lda_pipeline, lda_param_grid, cv=tscv, n_jobs=-1, verbose=1)
lda_grid_search.fit(X, y)
print("LDA best score:", lda_grid_search.best_score_)
print("LDA best params:", lda_grid_search.best_params_)

# -------------------------------------------------------------------------
# 8) FINAL TRAIN / TEST and EVALUATION
# -------------------------------------------------------------------------
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Use best SVM from Bayesian search (most efficient method)
best_svm_params = dict(bayes_search.best_params_)
final_svm_params = {k.replace('svm__', ''): v for k, v in best_svm_params.items()}

final_model = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(**final_svm_params, random_state=RANDOM_STATE))
])

print("Training final SVM model with optimized hyperparameters...")
final_model.fit(X_train, y_train)
y_pred = final_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final model test accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Down/Flat', 'Up'],
            yticklabels=['Down/Flat', 'Up'])
plt.title('Figure 1: Confusion Matrix (Final SVM)')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# -------------------------------------------------------------------------
# 9) Comparison of CV results
# -------------------------------------------------------------------------
models_comparison = {
    'SVM (Bayesian)': bayes_search.best_score_,
    'LDA': lda_grid_search.best_score_,
}

# Only add if they were run
if grid_search is not None:
    models_comparison['SVM (Grid Search)'] = grid_search.best_score_
if random_search is not None:
    models_comparison['SVM (Random Search)'] = random_search.best_score_
if nn_bayes_search is not None:
    models_comparison['Neural Network'] = nn_bayes_search.best_score_

comparison_df = pd.DataFrame(list(models_comparison.items()), columns=['Model', 'CV Score']).sort_values('CV Score', ascending=False)
print("\nModels comparison (CV scores):")
print(comparison_df)

plt.figure(figsize=(10, 5))
bars = plt.bar(comparison_df['Model'], comparison_df['CV Score'])
plt.title('Figure 2: Comparison of Model CV Scores')
plt.ylabel('CV Accuracy')
plt.xticks(rotation=45)
for bar in bars:
    plt.text(bar.get_x() + bar.get_width()/2., bar.get_height(), f"{bar.get_height():.4f}", ha='center', va='bottom')
plt.tight_layout()
plt.show()

# -------------------------------------------------------------------------
# 10) FEATURE IMPORTANCE via LDA coefficients
# -------------------------------------------------------------------------
best_lda = lda_grid_search.best_estimator_.named_steps['lda']
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': np.abs(best_lda.coef_[0]) if hasattr(best_lda, 'coef_') else np.zeros(len(X.columns))
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.gca().invert_yaxis()
plt.title('Figure 3: Top 10 features (LDA absolute coeff)')
plt.xlabel('Absolute Coefficient')
plt.tight_layout()
plt.show()

print("\nTop 10 most important features (LDA):")
print(feature_importance.head(10))
2025-10-20 22:34:15.440155: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-20 22:34:15.479622: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-20 22:34:16.368580: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
--- RUNNING IN ULTRA-FAST MODE ---
Downloading data...
[*********************100%***********************]  1 of 1 completed
Raw data shape: (1006, 6)
Calculating technical indicators...

Prepared data shapes:
X shape: (973, 11)
y shape: (973,)
Feature columns: ['RSI_14', 'MACD', 'MACD_signal', 'MACD_hist', 'BB_upper', 'BB_middle', 'BB_lower', 'OBV', 'return_lag_1', 'return_lag_2', 'volatility_10d']
Skipping Grid Search for SVM
Skipping Randomized Search for SVM
Running Bayesian Optimization for SVM...
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Bayes Search best score: 0.5185185185185185
Bayes Search best params: OrderedDict({'svm__C': 105.76211650904162, 'svm__gamma': 2.6096146808538574, 'svm__kernel': 'rbf'})
Skipping Neural Network for speed
Running Grid Search for LDA...
Fitting 3 folds for each of 6 candidates, totalling 18 fits
LDA best score: 0.4732510288065843
LDA best params: {'lda__shrinkage': 1.0, 'lda__solver': 'lsqr'}
Training final SVM model with optimized hyperparameters...
Final model test accuracy: 0.5436
              precision    recall  f1-score   support

           0       0.44      0.21      0.29        84
           1       0.57      0.79      0.66       111

    accuracy                           0.54       195
   macro avg       0.51      0.50      0.48       195
weighted avg       0.51      0.54      0.50       195

No description has been provided for this image
Models comparison (CV scores):
            Model  CV Score
0  SVM (Bayesian)  0.518519
1             LDA  0.473251
No description has been provided for this image
No description has been provided for this image
Top 10 most important features (LDA):
           feature  importance
4         BB_upper    0.096574
8     return_lag_1    0.078948
5        BB_middle    0.075201
10  volatility_10d    0.065124
2      MACD_signal    0.060802
0           RSI_14    0.059915
6         BB_lower    0.054014
1             MACD    0.053771
3        MACD_hist    0.008221
9     return_lag_2    0.007151

Interpretation

To get the best performance, we spent time tuning our models using Bayesian Search for the SVM and Grid Search for the LDA. We fed them SPY stock data that included technical indicators like RSI, MACD, and Bollinger Bands, and it's clear this tuning really mattered. If you look at Figure 2 (Comparison of Model CV Scores), you can see how this played out. The Bayesian-tuned SVM got a better cross-validation score (0.5185) than the LDA model (0.4733), so that's the one we picked. We then retrained this optimized SVM and tested it on a final, unseen dataset. The confusion matrix in Figure 1 (Final SVM) shows the results. The final test accuracy was 54.4% (106 correct out of 195). However, the model had a strong predictive bias: while it correctly spotted 88 "Up" days, it also wrongly labeled 66 "Down/Flat" days as "Up." Finally, Figure 3 (Top 10 features) gives us a peek into what data the models actually cared about. It turns out that price boundaries (like the upper and middle Bollinger Bands) and recent momentum (return_lag_1) were the most critical factors.

Optimizing the Bias-Variance Tradeoff¶

Technical Section¶

The bias-variance tradeoff provides a fundamental structural analysis of model generalization in quantitative finance, elucidating the decomposition of expected prediction error into its constituent components. This decomposition offers a systematic approach to model selection and regularization that aligns with the realities of modeling financial time series (Jema and Ali 1).

In the financial sector, this framework is especially crucial because it. Dale differentiates between (1) systematic model limitations (bias) that result in persistent mispricing; (2) estimation instability (variance) that causes unreliable performance across various market regimes; and (3) two types of uncertainty that cannot be eradicated through modeling (Timmermann 454).

The formalization of this tradeoff gives quants a way to make principled choices about how complex their models need to be in order to strike the right balance between accurately capturing sophisticated market dynamics and being robust to regime changes and data limitations. This is especially important in finance because (1) a simple model might not find profitable arbitrage opportunities because the discount rate is too simple, (2) a complex model that does well in backtesting might not make a lot of money when trading live, and (3) an estimation error has a cost that can be directly linked to a loss of funds (Dar, Vidya, and Richard 2).

Mathematical Decomposition of Mean Squared Error (MSE)¶

In the field of quantitative finance, we look at how financial factors $x$ (like valuation ratios, momentum signals, or economic indicators) relate to a target variable $y$ (such as asset returns, volatility, or the chance of default). We express this relationship as $y = f(x) + \epsilon$. Here, $\epsilon$ stands for the unique risk and market noise, which has a mean of zero and a variance of $\sigma^2$. We use past market data to create a predictive model $\hat{f}(x)$ to forecast future financial results. We can break down the expected prediction error for a new unseen observation like this (Hastie et al. 223):

$$ E[(y - \hat{f}(x))^2] = \underbrace{(E[\hat{f}(x)] - f(x))^2}_{\text{Pricing Bias}^2} + \underbrace{E[(\hat{f}(x) - E[\hat{f}(x)])^2]}_{\text{Model Instability}} + \underbrace{\sigma^2}_{\text{Systematic Market Risk}} $$

This key relationship in financial econometrics has an expression as:

$$ \text{Forecasting Error} = \text{Pricing Bias}^2 + \text{Model Instability} + \text{Systematic Market Risk} $$

The three parts stand for different causes of forecasting inaccuracy in financial modeling (Jema and Ali 7):

  1. Pricing Bias Squared: $(E[\hat{f}(x)] - f(x))^2$

    • Shows constant mispricing because of wrong model specs
    • Models with high bias (like CAPM for complex alternative assets) misprice securities across market conditions
    • Shows the difference between the model's average value and the real economic worth
  2. Model Instability: $E[(\hat{f}(x) - E[\hat{f}(x)])^2]$

    • Measures how sensitive the model is to specific time periods and data samples
    • Models with high variance (like complex neural networks) create very different asset allocations when you train them on different market cycles
    • Shows how much the model tends to fit too to temporary market oddities and random fluctuations
  3. Systematic Market Risk: $\sigma^2$

    • Stands for the part of asset returns you can't predict or control
    • You can't get rid of it even with better modeling, as it reflects basic economic uncertainty
    • Matches the smallest forecasting error you can achieve even if your model is perfect

The usual view of financial modeling shows a U-shaped curve (cliff). This suggests the best model sits at the "efficient frontier" balancing pricing efficiency and stability across different market conditions.

Example: Consider predicting stock returns using different modeling approaches:

  • High Bias: A simple dividend discount model applied to growth stocks often undervalues them.
  • High Variance: A complex deep learning model fits past data well but struggles with new economic conditions.
  • Best Balance: A multi-factor method aims to capture historical risk premiums driving returns. It also gives investors more stability than they'd achieve alone across various economic scenarios.

This breakdown forms the math foundation that explains why certain strategies succeed in some markets and fail in others. It also adds models of varying complexity for different asset types or investment timeframes.

The tradeoff with Regularization¶

Regularization methods play a key role in managing risk by keeping model complexity in check in quantitative finance. Adding penalty terms to the loss function helps us be more disciplined when estimating parameters. This tackles head-on the basic problem of telling real market signals apart from noise. Looking at it from a Bayesian econometric angle, regularization is like Maximum A Posteriori (MAP) estimation. It uses informative priors that capture financial know-how:

L2 Regularization (Ridge Regression) for Factor Models:

  • Financial loss function: $$\mathcal{L}_{L2} = \sum_{t=1}^T (r_t - \hat{r}_t)^2 + \lambda \sum_{j=1}^p \beta_j^2$$
  • Economic interpretation: Gaussian prior $\beta_j \sim \mathcal{N}(0, \tau^2)$ reflects the belief that most factor exposures should be modest
  • Portfolio effect: Shrinks factor weights toward zero while preserving all signals
  • Optimal for: Multi-factor models with correlated risk premia where we believe most factors contribute marginally (Hastie et al. 63).

L1 Regularization (Lasso Regression) for Signal Selection:

  • Financial loss function: $$\mathcal{L}_{L1} = \sum_{t=1}^T (r_t - \hat{r}_t)^2 + \lambda \sum_{j=1}^p |\beta_j|$$
  • Economic interpretation: Laplace prior $\beta_j \sim \text{Laplace}(0, b)$ assumes sparse true factor structure
  • Portfolio effect: Performs automatic signal selection by zeroing out noisy factors
  • Optimal for: High-dimensional alpha research with hundreds of potential signals where true drivers are sparse (Hastie et al. 69).

The Double Descent Phenomenon¶

Modern deep learning challenges conventional financial modeling wisdom through the "double descent" phenomenon (Belkin et al. 2), revealing profound implications for quantitative strategy development:

Classical Regime (Traditional Factor Modeling):

  • Performance follows U-shape: $\text{Tracking Error}(p) \approx \text{Specification Error}(p) + \text{Estimation Error}(p)$
  • Optimal complexity at $p = T$ where number of parameters matches historical observations;

Modern Regime (Deep Learning Approaches):

  • Performance improves again: $\text{Tracking Error}(p) \approx \frac{C}{p}$ for $p > T$
  • Over-parameterized models achieve superior out-of-sample performance despite perfect in-sample fit.

The mechanism involves implicit regularization through optimization. In over-parameterized financial models, stochastic gradient descent selects market-neutral solutions:

$$ \hat{\beta} = \arg\min_\beta \|\beta\|_2 \quad \text{subject to} \quad X\beta = r $$

This explains why massively parameterized neural networks can achieve better risk-adjusted returns, suggesting a paradigm shift from traditional parsimonious factor models toward capacity-driven approaches in certain quantitative domains (Jema and Ali 7, Belkin et al. 5).

Non-Technical Section¶

The most difficult part is coming up with models in quantitative finance that actually represent real opportunities, rather than simply random patterns that implode leaving huge drawdowns. This balance between model complexity and resilience must be handled with the utmost care.

Two Types of Strategy Failure

Underfitting (High Bias) - The Conservative Way:

  • Happens when models are too simplistic to reflect complex market dynamics.

For example Using only traditional value factors in technology driven markets.

  • Outcome: Continually fails to spot new trends and structural changes.
  • Portfolio Impact: Stable but sub-optimal returns; failed to capture alpha in changing markets.

Overfitting (High Variance) - The Curve-Fitting Trap: This happens when models are too complex, and they cannot distinguish between historical coincidences and true underlying patterns.

Example: A strategy explaining all past returns that totally collapses when the regime changes.

  • Outcome: Amazing backtest, real life loss.
  • Portfolio Impact: High daylight and concentration errors, as well as unexpected drawdowns when historical relationships stop working.

Finding the "Just Right" Strategy¶

When building a trading strategy, we're all stuck in a classic tug-of-war: we need a model that's smart enough to find real patterns in the market, but not so complicated that it breaks the second the market changes its mind.

It’s a balancing act. The strategy needs to be sophisticated enough to actually work, but also tough enough to survive in the wild (and not just in our backtest).

Regularization: The Built-in Bouncer¶

This is where "regularization" comes in, and it's less scary than it sounds. Think of it as a bouncer or a risk manager for your model. Its job is to keep the strategy from getting too crazy.

It does this by:

  • Forcing diversification: It stops the model from falling in love with just one signal. Instead, it encourages it to listen to several different, unrelated ideas.
  • Keeping it honest: It nudges the model to focus on relationships that actually make economic sense, not just random quirks in the data.
  • Preventing "one-hit wonders": It reduces the risk that your strategy only worked because of one specific time period (like the 2008 crash or the 2021 meme stock rally).

Basically, just like you diversify your portfolio with different assets, regularization diversifies your strategy with different signals.

When "Too Complicated" Becomes Good¶

Now, here’s the plot twist. We’ve all been told that "simple is better." But new research is showing that's not always true.

It turns out that if you make a model way, way more complex—past the point where you think it would just memorize everything—something interesting happens. This is called "double descent."

Instead of just memorizing the past (overfitting), these massive models can become so complex that they're forced to find a deeper, more general understanding of the data. They build a richer "picture" of the world.

In certain areas, like high-frequency trading or digging through weird "alternative data," these giant, complex models can actually beat the pants off the simpler, classic approaches. It flips the old wisdom on its head and suggests that for the messy, complex world of finance, bigger might sometimes be better.

Practical Applications to Investment Management¶

Being aware of this tradeoff is an essential consideration in portfolio construction and risk management:

  1. Choice of Strategy: Matching model complexity with investment horizon and characteristics of the asset class
  2. Risk Budget: How much risk you should take across well-established simple and more advanced strategies.
  3. Performance Attribution: True alpha vs overfitting backtest performances
  4. Research Direction: Allocating resources for refining classical techniques as well as investigating modern approaches

By understanding the bias-variance, we reach superior quantitative strategies that produce stable alpha while reducing the risk of breakdown in market stress regime.

In [3]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import yfinance as yf
import pandas_ta as ta  # technical analysis helpers

# Machine Learning / optimization
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# -------------------------------------------------------------------------
# CONFIGURATION
# -------------------------------------------------------------------------
# Set to True for a quick test run with few iterations.
# Set to False for a full, thorough search (will take a long time).
FAST_MODE = True
# -------------------------------------------------------------------------

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

# --- Define search parameters based on FAST_MODE ---
if FAST_MODE:
    print("--- RUNNING IN FAST_MODE ---")
    svr_gamma_points = 15
    reg_alpha_points = 30
    nn_width_range = [1, 2, 4, 6, 8, 10, 12, 14, 16, 24, 32, 64]
    nn_epochs = 100
else:
    print("--- RUNNING IN FULL_MODE (this will take time) ---")
    svr_gamma_points = 30
    reg_alpha_points = 100
    nn_width_range = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 20, 24, 32, 48, 64, 96, 128, 192, 256]
    nn_epochs = 500


# -------------------------------------------------------------------------
# 1) DATA ACQUISITION & FEATURE ENGINEERING
# -------------------------------------------------------------------------
ticker = 'SPY'
data = yf.download(ticker, start='2010-01-01', end='2023-12-31', auto_adjust=False)
if isinstance(data.columns, pd.MultiIndex):
    data.columns = data.columns.droplevel(1)
data = data.reset_index()

# --- Features (same as before) ---
data_indexed = data.set_index('Date')
data['RSI_14'] = ta.rsi(data_indexed['Close'], length=14).values
macd = ta.macd(data_indexed['Close'], fast=12, slow=26, signal=9)
data['MACD'] = macd.iloc[:, 0].values
data['MACD_hist'] = macd.iloc[:, 1].values
bbands = ta.bbands(data_indexed['Close'], length=20)
data['BB_upper'] = bbands.iloc[:, 2].values
data['BB_middle'] = bbands.iloc[:, 1].values
data['BB_lower'] = bbands.iloc[:, 0].values
data['OBV'] = ta.obv(data_indexed['Close'], data_indexed['Volume']).values
data['ATR_14'] = ta.atr(data_indexed['High'], data_indexed['Low'], data_indexed['Close'], length=14).values
for lag in range(1, 6):
    data[f'return_lag_{lag}'] = data['Close'].pct_change(lag)
data['volatility_20d'] = data['Close'].pct_change().rolling(window=20).std()

# --- TARGET (MODIFIED FOR REGRESSION) ---
# We predict the next day's percentage return, matching the MSE math.
# Multiply by 100 to make MSE values more readable.
data['target'] = data['Close'].pct_change(1).shift(-1) * 100

print("Data preparation...")
data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)

# --- Define X and y ---
features_to_drop = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'target']
X = data.drop(columns=[col for col in features_to_drop if col in data.columns])
y = data['target']

# --- Train/Test Split ---
train_size = int(len(X) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

# --- Scale Data (CRUCIAL) ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
n_features = X.shape[1]

print(f"Data ready. n_features: {n_features}, n_train: {len(y_train)}, n_test: {len(y_test)}")

# -------------------------------------------------------------------------
# 2) SECTION 1: THE CLASSIC BIAS-VARIANCE U-CURVE
# -------------------------------------------------------------------------
print("\nRunning Section 1: Classic Bias-Variance U-Curve (SVR)...")

# We vary 'gamma' to control complexity.
# Low gamma = simple model (high bias)
# High gamma = complex model (high variance)
gamma_range = np.logspace(-4, 1, svr_gamma_points)
train_errors = []
test_errors = []

for gamma in gamma_range:
    model = SVR(kernel='rbf', gamma=gamma, C=1.0) # C=1 is a reasonable default
    model.fit(X_train_scaled, y_train)

    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)

    train_errors.append(mean_squared_error(y_train, y_pred_train))
    test_errors.append(mean_squared_error(y_test, y_pred_test))

plt.figure(figsize=(10, 6))
plt.plot(gamma_range, train_errors, 'b-o', label='Train Error (Bias Proxy)')
plt.plot(gamma_range, test_errors, 'r-o', label='Test Error (Total Error)')
plt.xscale('log')
plt.xlabel('Model Complexity (SVR Gamma)')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Figure 4: The Classic Bias-Variance Tradeoff')
plt.legend()
plt.ylim(0, np.mean(y_test**2) * 1.5) # Cap y-axis at 1.5x variance of y
plt.fill_between(gamma_range, train_errors, test_errors, color='orange', alpha=0.2, label='Variance Proxy (Train-Test Gap)')
plt.axvline(x=gamma_range[np.argmin(test_errors)], color='grey', linestyle='--', label=f'Optimal Gamma')
plt.legend()
plt.show()

# -------------------------------------------------------------------------
# 3) SECTION 2: REGULARIZATION (L1 & L2)
# -------------------------------------------------------------------------
print("\nRunning Section 2: Regularization Paths (Lasso/Ridge)...")

alpha_range = np.logspace(-5, 2, reg_alpha_points)

lasso_coefs = []
lasso_test_mse = []
ridge_coefs = []
ridge_test_mse = []

for alpha in alpha_range:
    # Lasso (L1)
    lasso = Lasso(alpha=alpha, random_state=RANDOM_STATE, max_iter=1000)
    lasso.fit(X_train_scaled, y_train)
    lasso_coefs.append(lasso.coef_)
    lasso_test_mse.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))

    # Ridge (L2)
    ridge = Ridge(alpha=alpha, random_state=RANDOM_STATE)
    ridge.fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
    ridge_test_mse.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))

# --- Plot Regularization Results ---
fig, axs = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Figure 5: Regularization Impact on Bias-Variance Tradeoff', fontsize=16)

# Plot 1: Lasso (L1) Coefficient Path
axs[0, 0].plot(alpha_range, lasso_coefs)
axs[0, 0].set_xscale('log')
axs[0, 0].set_title('A) L1 (Lasso) Coefficient Path')
axs[0, 0].set_xlabel('Penalty (alpha)')
axs[0, 0].set_ylabel('Coefficient Weight')

# Plot 2: Lasso (L1) MSE vs. Alpha
axs[0, 1].plot(alpha_range, lasso_test_mse, 'r-o')
axs[0, 1].set_xscale('log')
axs[0, 1].set_title('B) L1 (Lasso) Test Error')
axs[0, 1].set_xlabel('Penalty (alpha)')
axs[0, 1].set_ylabel('Test MSE')
axs[0, 1].axvline(x=alpha_range[np.argmin(lasso_test_mse)], color='grey', linestyle='--', label='Optimal Alpha')
axs[0, 1].legend()
axs[0, 1].set_ylim(min(lasso_test_mse) * 0.95, min(lasso_test_mse) * 1.5) # Zoom in

# Plot 3: Ridge (L2) Coefficient Path
axs[1, 0].plot(alpha_range, ridge_coefs)
axs[1, 0].set_xscale('log')
axs[1, 0].set_title('C) L2 (Ridge) Coefficient Path')
axs[1, 0].set_xlabel('Penalty (alpha)')
axs[1, 0].set_ylabel('Coefficient Weight')

# Plot 4: Ridge (L2) MSE vs. Alpha
axs[1, 1].plot(alpha_range, ridge_test_mse, 'r-o')
axs[1, 1].set_xscale('log')
axs[1, 1].set_title('D) L2 (Ridge) Test Error')
axs[1, 1].set_xlabel('Penalty (alpha)')
axs[1, 1].set_ylabel('Test MSE')
axs[1, 1].axvline(x=alpha_range[np.argmin(ridge_test_mse)], color='grey', linestyle='--', label='Optimal Alpha')
axs[1, 1].legend()
axs[1, 1].set_ylim(min(ridge_test_mse) * 0.95, min(ridge_test_mse) * 1.5) # Zoom in

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# -------------------------------------------------------------------------
# 4) SECTION 3: THE DOUBLE DESCENT PHENOMENON
# -------------------------------------------------------------------------
print("\nRunning Section 3: Double Descent (Neural Network)...")
print(f"(Training {len(nn_width_range)} models for {nn_epochs} epochs each...)")

# We need a small dataset to find the "interpolation threshold" (p ≈ n)
n_samples = 200
X_train_sub = X_train_scaled[:n_samples]
y_train_sub = y_train.iloc[:n_samples]

# Estimate interpolation threshold: p ≈ n
# For a 1-layer NN, p ≈ (n_features * width) + width + (width * 1) + 1
# p ≈ (n_features + 2) * width
# p ≈ n_samples  =>  (17 + 2) * width ≈ 200 => 19 * width ≈ 200
interpolation_threshold_width = n_samples / (n_features + 2)
print(f"Interpolation Threshold (p≈n) estimated at width ≈ {interpolation_threshold_width:.1f}")

nn_train_errors = []
nn_test_errors = []

for width in nn_width_range:
    # Define a simple 1-hidden-layer NN
    # NO regularization (no dropout, no L1/L2) is key to seeing the effect
    model = Sequential([
        Dense(width, activation='relu', input_dim=n_features),
        Dense(1)  # Linear output for regression
    ])

    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

    # Train to interpolation (or close to it)
    model.fit(X_train_sub, y_train_sub, epochs=nn_epochs, verbose=0, batch_size=32)

    # Evaluate
    train_mse = model.evaluate(X_train_sub, y_train_sub, verbose=0)
    test_mse = model.evaluate(X_test_scaled, y_test, verbose=0)

    nn_train_errors.append(train_mse)
    nn_test_errors.append(test_mse)

    if width % 8 == 0 or width == 1:
        print(f"  Width: {width:3d} | Train MSE: {train_mse:7.4f} | Test MSE: {test_mse:7.4f}")


plt.figure(figsize=(12, 7))
plt.plot(nn_width_range, nn_train_errors, 'b-o', label='Train Error')
plt.plot(nn_width_range, nn_test_errors, 'r-o', label='Test Error')
plt.xscale('log')
plt.yscale('log')
plt.title('Figure 6: The Double Descent Phenomenon')
plt.xlabel('Model Capacity (Width of Hidden Layer)')
plt.ylabel('Mean Squared Error (Log Scale)')

# Highlight the two regimes
plt.axvline(x=interpolation_threshold_width, color='grey', linestyle='--',
            label=f'Interpolation Threshold (p≈n)\nWidth ≈ {interpolation_threshold_width:.1f}')
plt.axvspan(min(nn_width_range), interpolation_threshold_width, alpha=0.1, color='blue', label='Classical Regime')
plt.axvspan(interpolation_threshold_width, max(nn_width_range), alpha=0.1, color='green', label='Modern Regime')
plt.legend()
plt.show()
--- RUNNING IN FAST_MODE ---
[*********************100%***********************]  1 of 1 completed
Data preparation...
Data ready. n_features: 14, n_train: 2790, n_test: 698

Running Section 1: Classic Bias-Variance U-Curve (SVR)...
No description has been provided for this image
Running Section 2: Regularization Paths (Lasso/Ridge)...
No description has been provided for this image
Running Section 3: Double Descent (Neural Network)...
(Training 12 models for 100 epochs each...)
Interpolation Threshold (p≈n) estimated at width ≈ 12.5
2025-10-20 22:34:31.243612: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
  Width:   1 | Train MSE:  1.3753 | Test MSE:  1.2711
  Width:   8 | Train MSE:  1.2898 | Test MSE:  2.5647
  Width:  16 | Train MSE:  1.2056 | Test MSE:  4.3503
  Width:  24 | Train MSE:  1.2509 | Test MSE:  1.8374
  Width:  32 | Train MSE:  1.1483 | Test MSE:  7.0487
  Width:  64 | Train MSE:  1.0316 | Test MSE:  5.3992
No description has been provided for this image

Interpretation

When we analyzed the model complexity on the SPY stock data, we were really trying to understand the bias-variance tradeoff. Our results point to three key concepts.

First, Figure 4 shows the classic "U-shaped" error curve. We used a Support Vector Regressor (SVR) to demonstrate this. You can see the test error (red line) initially drops as the model gets more complex (by increasing the 'Gamma' parameter), which means we're reducing bias. But after hitting a low point, the error starts to climb again. That's the model beginning to overfit and increase its variance. The best model is right at the bottom of that curve.

Next, Figure 5 looks at how L1 (Lasso) and L2 (Ridge) regularization help manage this tradeoff. Plots A and C show L1's aggressive approach, forcing some coefficients to exactly zero (which is a form of feature selection), while L2 just shrinks them. Plots B and D confirm that both methods have an "Optimal Alpha," or penalty strength, that finds the best possible test error and prevents overfitting.

Finally, Figure 6 uses a neural network to show something that challenges the classic U-curve: the "Double Descent" phenomenon. Here, the test error peaks (in the "Classical Regime"), right around the point where the number of parameters matches the number of data points. But then, as we make the model even bigger, the error surprisingly starts to go down again (in the "Modern Regime"). This suggests that making a model extremely large, or "over-parameterized," can paradoxically lead to better results.

Applying Ensemble Learning: Bagging, Boosting, or Stacking¶

Technical Section¶

Ensemble learning is a machine learning paradigm that strategically combines multiple base models to produce a single, superior predictive model. The foundational principle, often termed the "wisdom of crowds," posits that a collective decision from a diverse set of models is more robust and accurate than that of any single constituent model (Hastie et al. 605). The efficacy of an ensemble is causally dependent on two conditions: the individual accuracy of the base learners and, more critically, their diversity meaning they make uncorrelated errors on different data instances (Hastie et al. 608). If all base models are highly correlated, the ensemble offers marginal benefit. This principle guides model selection: we combine fundamentally different algorithms like SVMs (geometric, margin-based), Neural Networks (connectionist, non-linear), and linear models (probabilistic, assumption-driven) to ensure diverse "opinions" that a meta-learner can optimally synthesize.

Homogeneous Ensembles: Bagging and Boosting¶

These methods use multiple instances of a single, typically weak, base learning algorithm.

1. Bagging (Bootstrap Aggregating)

Bagging is a parallel ensemble method designed primarily to reduce variance in high-variance, low-bias models like deep decision trees (Breiman 1). The algorithm operates as follows:

  1. Bootstrap Sampling: Given a training set $D = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, Bagging generates $ m $ new training sets $D_i $, each of size $n $, by sampling from $D $ uniformly and with replacement. This means each $D_i $ is a random variation of the original data, with some samples repeated and others omitted (the "out-of-bag" samples).

  2. Parallel Training: A base model $h_i $ is trained independently on each bootstrap sample $D_i $.

  3. Aggregation: For a regression task, the final prediction $\hat{y} $ is the average of all individual predictions: $$ \hat{y} = \frac{1}{m} \sum_{i=1}^{m} h_i(x) $$ For classification, it is the mode (majority vote): $$ \hat{y} = \underset{c}{\text{mode}} \{ h_1(x), h_2(x), \ldots, h_m(x) \} $$ The Random Forest algorithm is a quintessential application of Bagging to decision trees, which introduces further diversity by randomly selecting a subset of features at each split, thereby de-correlating the trees even more (Breiman 5).

2. Boosting

Boosting is a sequential technique that converts a set of weak learners into a strong learner by iteratively focusing on misclassified examples, thereby reducing bias (Freund and Schapire 3). The most famous algorithm, AdaBoost (Adaptive Boosting), works as follows:

  1. Initialize Weights: Assign equal weight $w_i^{(1)} = \frac{1}{n} $ to each training instance.
  2. For each iteration $t = 1 $ to $T $:

a) Train a weak learner $h_t $ on the training data weighted by $w_i^{(t)} $.

b) Calculate the weighted error of $h_t $: $\epsilon_t = \sum_{i=1}^{n} w_i^{(t)} \cdot \mathbf{1}(y_i \neq h_t(x_i)) $.

  1. Final Model: The final prediction is a weighted majority vote: $$ H(x) = \text{sign}\left( \sum_{t=1}^{T} \alpha_t h_t(x) \right) $$

c) Compute the learner's weight (or "vote"): $\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right) $. This equation shows that a model with a lower error $\epsilon_t $ is given a higher weight $\alpha_t $ in the final ensemble. d. Update the instance weights: Increase the weight of misclassified instances so the next learner focuses on them: $$ w_i^{(t+1)} = \frac{w_i^{(t)} \cdot \exp\left(-\alpha_t y_i h_t(x_i)\right)}{Z_t} $$ where $Z_t $ is a normalization factor to ensure the weights sum to one.

Heterogeneous Ensembles: Stacking (Stacked Generalization)¶

Stacking is a hierarchical method designed to combine multiple, diverse (heterogeneous) base models. It has consistently been shown to outperform individual models in complex financial forecasting tasks, including stock prediction and credit scoring (Fischer and Krauss 3).

The architecture of a stacking ensemble consists of two levels:

  • Level-0 (Base Learners): A set of $L $ diverse models $h_1, h_2, \ldots, h_L $ (e.g., SVM, NN, LDA) are trained on the original training data.
  • Level-1 (Meta-Learner): A meta-learner $g $ is trained to make the final prediction. Its input features are not the original data, but the predictions of the base learners.

To prevent data leakage and create a robust training set for the meta-learner, a $K $-fold cross-validation procedure is employed:

  1. The training data $D $ is split into $K $ folds, $D_1, D_2, \ldots, D_K $.
  2. For each base learner $h_l $:
  • For each fold $k $, train $h_l $ on all folds except $D_k $, then use it to predict on $D_k $. These are the "out-of-fold" predictions.
  • The collection of all out-of-fold predictions from $h_l $ forms a new feature vector $z_l $.
  1. The meta-feature matrix $Z = [z_1, z_2, \ldots, z_L] $ is constructed, with the original targets $y $. The meta-learner $g $ is trained on $(Z, y) $.
  2. The final base models, trained on the full training set, and the trained meta-learner form the stacking ensemble.

The power of stacking lies in its ability to learn an optimal combination strategy. Unlike a static weighted average, the meta-learner can be a non-linear function (e.g., a Gradient Boosting Machine) that learns a dynamic weighting scheme. For instance, it can learn to trust the Neural Network during high-volatility regimes while deferring to the SVM in trending markets, creating a highly adaptive and powerful predictor.

Proposed Stacking Framework for Financial Prediction:

  • Base Learners (Level-0): The optimized SVM, Neural Network, and a regularized linear model (e.g., Logistic Regression) from previous work.
  • Meta-Learner (Level-1): A simple, well-regularized model like Logistic Regression is a robust default to prevent overfitting. For capturing complex interactions between base models, a regularized Gradient Boosting Machine (e.g., XGBoost) can be highly effective, with the optimal choice determined via cross-validation.

Comparison of Ensemble Methods:

Method Core Principle Primary Goal Base Learner Type Combination Method Key Advantage in Finance
Bagging Trains many models at the same time, each on a slightly different random sample (bootstrap) of the data. To reduce variance and make the model more stable. Homogeneous (e.g., all Decision Trees) Lets all the models vote (classification) or averages their results (regression). Makes the model more stable and less likely to "overfit" or just memorize market noise.
Boosting Builds models one after another, where each new model tries to fix the mistakes the previous one made. To reduce bias (by fixing errors) and variance. Homogeneous (e.g., simple "weak" learners) A weighted sum, giving more power to the predictions from better models. Creates a very accurate predictor by focusing on the hard-to-predict cases.
Stacking Trains several different models, then trains a final "meta-model" that learns how to best combine their predictions. To get the best possible predictive performance by mixing models. Heterogeneous (e.g., SVM, NN, Trees) A final "meta-learner" model makes the call based on what the first-level models said. Dynamically picks the best parts of each model, making it great for complex forecasts where no single model is best.

Non-Technical Section¶

Relying on just one predictive model is like getting a diagnosis from a single doctor. Even if they're a top expert, their view is still just one perspective. A better approach is to get a "panel" of different experts, listen to all of them, and then make a final call. In data analysis, we have a formal way to do this: Ensemble Learning.

Think of the models we've built—the Linear Discriminant, the SVM, and the Neural Network—as our panel of specialists. Each one has a different way of looking at the problem:

  • The Linear Model is the "by-the-book" economist. It sticks to clear, established relationships and proven statistical rules.
  • The Support Vector Machine (SVM) is the expert pattern-finder. It's great at spotting complex, non-obvious boundaries between market "regimes" that other models might miss.
  • The Neural Network is the deep analyst. It can sift through huge amounts of information to find subtle, hidden patterns that aren't simple or linear.

So, how do we manage this team? We have three basic strategies:

  1. Taking a Poll (Bagging): This is the simplest way. We just create hundreds of versions of one type of specialist (like the linear model), giving each one a slightly different piece of the data. Then, we let them all vote. The majority decision wins. This is a great way to smooth out any one model's weird quirks and get a stable prediction that isn't just reacting to market noise (Odegua 2).

  2. The Focused Team (Boosting): This approach is all about teamwork and fixing mistakes. The first model makes a prediction. The second model then looks at the results and focuses specifically on what the first one got wrong. The third model focuses on the remaining mistakes, and so on. You end up with a highly accurate "task force" that's really good at predicting the tough cases (Freund and Schapire 1).

  3. The Expert Council with a Manager (Stacking): This is the most advanced approach, and the one we like best. We let our diverse team of specialists (Linear, SVM, and NN) each make their own prediction. Then, we bring in a "manager"—a new model called a meta-learner. The manager's only job is to learn the strengths and weaknesses of the other specialists. It's not looking at the raw market data, it's looking at how the other models behave. Based on this, it learns how to intelligently combine their opinions to make the best possible final decision. For example, it might learn to trust the SVM more in a chaotic market and the linear model in a stable one (Wolpert 2)

Recomendations

We strongly recommend using the "Expert Council" (Stacking) approach.

This strategy directly answers our biggest question: "How do we use all these models together?" It doesn't force us to pick a single "winner." Instead, it creates a smart system that uses the collective intelligence of all our models. This is the most solid way to build a strategy that can adapt, learn, and find reliable returns.

Identification of Factors That Impact the Portfolio

Adopting a stacking model has a few direct, positive impacts on performance and risk:

  • Better Predictions: By mixing the unique insights from each model, the final signal is almost always more accurate than any single model could be on its own. This should lead to better returns.
  • More Stability (Less Volatility): This is like having a "portfolio of models." If one specialist has a bad day or a specific market condition tricks it, the others are there to pick up the slack. This leads to steadier performance and a smoother ride.
  • Adapts to the Market: The "manager" model automatically learns to spot different market types (e.g., bull, bear, sideways) and gives more weight to the specialist that's best for those specific conditions. This makes the whole strategy much more resilient over time.
In [4]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import yfinance as yf
import pandas_ta as ta  # technical analysis helpers

# Machine Learning / optimization
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.base import clone # Import clone

# Ensemble Methods
from sklearn.ensemble import RandomForestClassifier # (Bagging)
from sklearn.ensemble import AdaBoostClassifier # (Boosting)
# StackingClassifier is no longer used
# from sklearn.ensemble import StackingClassifier 

# Neural Network
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from scikeras.wrappers import KerasClassifier

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# -------------------------------------------------------------------------
# CONFIGURATION
# -------------------------------------------------------------------------
# Set to True for a quick test run with few iterations/estimators.
# Set to False for a full, thorough run (will take longer).
FAST_MODE = True
# -------------------------------------------------------------------------

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

# --- Define run parameters based on FAST_MODE ---
if FAST_MODE:
    print("--- RUNNING IN FAST_MODE ---")
    nn_epochs = 10
    rf_estimators = 50
    ada_estimators = 50
else:
    print("--- RUNNING IN FULL_MODE (this will take time) ---")
    nn_epochs = 50
    rf_estimators = 200
    ada_estimators = 200

# -------------------------------------------------------------------------
# 1) DATA ACQUISITION & FEATURE ENGINEERING
# -------------------------------------------------------------------------
print("Fetching and preparing data...")
ticker = 'SPY'
data = yf.download(ticker, start='2010-01-01', end='2023-12-31', auto_adjust=False)
if isinstance(data.columns, pd.MultiIndex):
    data.columns = data.columns.droplevel(1)
data = data.reset_index()

# --- Features (same as before) ---
data_indexed = data.set_index('Date')
data['RSI_14'] = ta.rsi(data_indexed['Close'], length=14).values
macd = ta.macd(data_indexed['Close'], fast=12, slow=26, signal=9)
data['MACD'] = macd.iloc[:, 0].values
data['MACD_hist'] = macd.iloc[:, 1].values
bbands = ta.bbands(data_indexed['Close'], length=20)
data['BB_upper'] = bbands.iloc[:, 2].values
data['BB_middle'] = bbands.iloc[:, 1].values
data['BB_lower'] = bbands.iloc[:, 0].values
data['OBV'] = ta.obv(data_indexed['Close'], data_indexed['Volume']).values
data['ATR_14'] = ta.atr(data_indexed['High'], data_indexed['Low'], data_indexed['Close'], length=14).values
for lag in range(1, 6):
    data[f'return_lag_{lag}'] = data['Close'].pct_change(lag)
data['volatility_20d'] = data['Close'].pct_change().rolling(window=20).std()

# --- TARGET (CLASSIFICATION) ---
# We predict the binary direction: 1 for 'Up', 0 for 'Down' or 'Flat'
data['target'] = (data['Close'].shift(-1) > data['Close']).astype(int)

# --- Cleaning and Splitting ---
data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)

features_to_drop = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'target']
X = data.drop(columns=[col for col in features_to_drop if col in data.columns])
y = data['target']

# --- Standard 80/20 Train/Test Split ---
# We will use the *raw* (unscaled) X_train/X_test for pipelines
train_size = int(len(X) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

# --- Scaled data for non-pipeline models ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- TimeSeriesSplit for CV in Stacking ---
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

print(f"Data ready. X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")


# -------------------------------------------------------------------------
# 2) DEFINE "THE SPECIALISTS" (BASE MODELS)
# -------------------------------------------------------------------------
# We define our three diverse models as pipelines, using
# "optimal" params found in previous steps.

# Specialist 1: SVM (The Pattern-Recognizer)
# Using assumed best params from a Bayesian search
svm_params = {'C': 10.0, 'gamma': 0.01, 'kernel': 'rbf', 'probability': True}
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(**svm_params, random_state=RANDOM_STATE))
])

# Specialist 2: Neural Network (The Deep-Analyst)
# Keras model creation function
def create_nn_model(learning_rate=0.001, dropout_rate=0.2, l2_reg=0.01, num_layers=2, neurons_per_layer=64):
    model = Sequential()
    model.add(Dense(neurons_per_layer, input_dim=X.shape[1], activation='relu',
                    kernel_regularizer=l2(l2_reg)))
    model.add(Dropout(dropout_rate))
    for _ in range(num_layers - 1):
        model.add(Dense(max(4, neurons_per_layer // 2), activation='relu',
                        kernel_regularizer=l2(l2_reg)))
        model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Using assumed best params from a Bayesian search
nn_params = {
    'model__learning_rate': 0.001,
    'model__dropout_rate': 0.2,
    'model__l2_reg': 0.01,
    'model__num_layers': 2,
    'model__neurons_per_layer': 64,
    'batch_size': 32
}
nn_classifier = KerasClassifier(
    model=create_nn_model,
    epochs=nn_epochs,
    verbose=0,
    random_state=RANDOM_STATE,
    **nn_params
)
nn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', nn_classifier)
])

# Specialist 3: LDA (The Cautious Economist)
# Using assumed best params from a Grid search
lda_params = {'solver': 'lsqr', 'shrinkage': 0.1}
lda_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis(**lda_params))
])

# List of our specialist estimators for Stacking
base_estimators = [
    ('svm', svm_pipeline),
    ('nn', nn_pipeline),
    ('lda', lda_pipeline)
]

# -------------------------------------------------------------------------
# 3) IMPLEMENT ENSEMBLE STRATEGIES
# -------------------------------------------------------------------------

# Dictionary to store results
model_scores = {}

# --- Strategy 1: Bagging (The Democratic Poll) ---
print("\nTraining Strategy 1: Bagging (Random Forest)...")
rf_model = RandomForestClassifier(
    n_estimators=rf_estimators,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    max_depth=10 # Control tree depth to prevent overfitting
)
# Tree models don't strictly need scaling, but we use scaled data
# for a fair comparison with other scaled, non-pipelined models.
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
model_scores['Bagging (RandomForest)'] = rf_accuracy
print(f"Random Forest Test Accuracy: {rf_accuracy:.4f}")
# print(classification_report(y_test, y_pred_rf))


# --- Strategy 2: Boosting (The Focused Task Force) ---
print("\nTraining Strategy 2: Boosting (AdaBoost)...")
ada_model = AdaBoostClassifier(
    n_estimators=ada_estimators,
    random_state=RANDOM_STATE,
    learning_rate=0.1
)
ada_model.fit(X_train_scaled, y_train)
y_pred_ada = ada_model.predict(X_test_scaled)
ada_accuracy = accuracy_score(y_test, y_pred_ada)
model_scores['Boosting (AdaBoost)'] = ada_accuracy
print(f"AdaBoost Test Accuracy: {ada_accuracy:.4f}")
# print(classification_report(y_test, y_pred_ada))


# --- Strategy 3: Stacking (The Expert Council) - MANUAL IMPLEMENTATION ---
print("\nTraining Strategy 3: Stacking (Manual Time-Series Method)...")
# The "Chairperson" (meta-learner)
meta_learner = LogisticRegression(C=1.0, random_state=RANDOM_STATE, n_jobs=-1)

# Lists to store out-of-fold predictions (meta-features) and corresponding targets
meta_X_train_list = []
meta_y_train_list = []

print("Generating out-of-fold predictions for meta-learner...")
# Loop through each TimeSeriesSplit fold
for fold_n, (train_idx, val_idx) in enumerate(tscv.split(X_train)):
    print(f"  ...Processing Fold {fold_n+1}/{n_splits}")
    X_fold_train, X_fold_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_fold_train, y_fold_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

    # Store predictions for this fold
    fold_meta_features = []
    
    # Train each base model and get predictions on the validation set
    for name, model_pipeline in base_estimators:
        cloned_model = clone(model_pipeline)
        cloned_model.fit(X_fold_train, y_fold_train)
        
        # Use predict_proba for more info (probability of class 1)
        # Ensure model has predict_proba (like LDA) or handle it
        if hasattr(cloned_model, "predict_proba"):
            preds = cloned_model.predict_proba(X_fold_val)[:, 1]
        else: # Fallback for models like base SVM without probability=True
            preds = cloned_model.predict(X_fold_val)
            
        fold_meta_features.append(preds)

    # Add this fold's predictions (meta-features) to the list
    meta_X_train_list.append(np.column_stack(fold_meta_features))
    # Add this fold's true targets to the list
    meta_y_train_list.append(y_fold_val)

# Concatenate all out-of-fold predictions to create the meta-learner's training set
Z_train = np.concatenate(meta_X_train_list)
y_meta_train = np.concatenate(meta_y_train_list)

print(f"\nMeta-learner training on {Z_train.shape[0]} out-of-fold samples.")

# Train the "Chairperson" (meta-learner) on the out-of-fold predictions
meta_learner.fit(Z_train, y_meta_train)

# --- Generate predictions for the final X_test ---
# 1. Train all base models on the *entire* X_train dataset
# 2. Generate predictions (meta-features) on X_test
# 3. Feed these meta-features to the *trained* meta_learner

print("Training base models on full data for final test prediction...")
Z_test_list = []
for name, model_pipeline in base_estimators:
    # We must re-fit the models on the *full* training data
    cloned_model = clone(model_pipeline)
    cloned_model.fit(X_train, y_train)
    
    if hasattr(cloned_model, "predict_proba"):
        preds = cloned_model.predict_proba(X_test)[:, 1]
    else:
        preds = cloned_model.predict(X_test)
        
    Z_test_list.append(preds)

# Create the meta-feature matrix for the test set
Z_test = np.column_stack(Z_test_list)

print("Generating final stacking predictions...")
# Get the final prediction from the meta-learner
y_pred_stack = meta_learner.predict(Z_test)
stack_accuracy = accuracy_score(y_test, y_pred_stack)
model_scores['Stacking (Manual)'] = stack_accuracy # Updated name
print(f"Stacking Test Accuracy: {stack_accuracy:.4f}")

print("\nStacking Model Classification Report (Test Set):")
print(classification_report(y_test, y_pred_stack))

# Confusion matrix for the recommended model
cm = confusion_matrix(y_test, y_pred_stack)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Down/Flat', 'Up'],
            yticklabels=['Down/Flat', 'Up'])
plt.title('Figure 7: Confusion Matrix (Stacking Model)')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# -------------------------------------------------------------------------
# 4) FINAL COMPARISON
# -------------------------------------------------------------------------
print("\nTraining individual 'specialists' for comparison...")

# Train and score each base model individually
# We fit on *raw* X_train, as they are pipelines
svm_pipeline.fit(X_train, y_train)
model_scores['SVM (Specialist)'] = accuracy_score(y_test, svm_pipeline.predict(X_test))
print(f"Individual SVM Accuracy: {model_scores['SVM (Specialist)']:.4f}")

lda_pipeline.fit(X_train, y_train)
model_scores['LDA (Specialist)'] = accuracy_score(y_test, lda_pipeline.predict(X_test))
print(f"Individual LDA Accuracy: {model_scores['LDA (Specialist)']:.4f}")

# The NN pipeline takes a bit longer
nn_pipeline.fit(X_train, y_train)
model_scores['NN (Specialist)'] = accuracy_score(y_test, nn_pipeline.predict(X_test))
print(f"Individual NN Accuracy: {model_scores['NN (Specialist)']:.4f}")


# --- Plot Comparison Chart ---
comparison_df = pd.DataFrame(list(model_scores.items()), columns=['Model', 'Test Accuracy'])
comparison_df = comparison_df.sort_values('Test Accuracy', ascending=False)

print("\n--- Final Model Comparison ---")
print(comparison_df.to_string(index=False))

plt.figure(figsize=(10, 6))
bars = sns.barplot(
    data=comparison_df,
    x='Test Accuracy',
    y='Model',
    palette='viridis'
)
plt.title('Figure 8: Ensemble vs. Individual Model Performance')
plt.xlabel('Test Set Accuracy')
plt.ylabel('Model')
plt.xlim(min(model_scores.values()) * 0.95, max(model_scores.values()) * 1.05)

# Add text labels
for bar in bars.patches:
    bars.annotate(
        f'{bar.get_width():.4f}',
        (bar.get_width(), bar.get_y() + bar.get_height() / 2),
        ha='left', va='center',
        size=10, xytext=(5, 0),
        textcoords='offset points'
    )

plt.tight_layout()
plt.show()
[*********************100%***********************]  1 of 1 completed
--- RUNNING IN FAST_MODE ---
Fetching and preparing data...
Data ready. X_train shape: (2791, 14), X_test shape: (698, 14)

Training Strategy 1: Bagging (Random Forest)...
Random Forest Test Accuracy: 0.4756

Training Strategy 2: Boosting (AdaBoost)...

AdaBoost Test Accuracy: 0.5201

Training Strategy 3: Stacking (Manual Time-Series Method)...
Generating out-of-fold predictions for meta-learner...
  ...Processing Fold 1/5
2025-10-20 22:35:05.520299: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
  ...Processing Fold 2/5
  ...Processing Fold 3/5
  ...Processing Fold 4/5
  ...Processing Fold 5/5
2025-10-20 22:35:12.407510: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
Meta-learner training on 2325 out-of-fold samples.
Training base models on full data for final test prediction...
Generating final stacking predictions...
Stacking Test Accuracy: 0.5201

Stacking Model Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       335
           1       0.52      1.00      0.68       363

    accuracy                           0.52       698
   macro avg       0.26      0.50      0.34       698
weighted avg       0.27      0.52      0.36       698

No description has been provided for this image
Training individual 'specialists' for comparison...
Individual SVM Accuracy: 0.5244
Individual LDA Accuracy: 0.5186
2025-10-20 22:35:17.634494: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
Individual NN Accuracy: 0.5201

--- Final Model Comparison ---
                 Model  Test Accuracy
      SVM (Specialist)       0.524355
   Boosting (AdaBoost)       0.520057
       NN (Specialist)       0.520057
     Stacking (Manual)       0.520057
      LDA (Specialist)       0.518625
Bagging (RandomForest)       0.475645
No description has been provided for this image

Interpretation

We also analyzed ensemble learning methods, applying Bagging, Boosting, and Stacking to the SPY stock data to see how they'd perform.

The results are summarized in Figure 8 (Ensemble vs. Individual Model Performance). What's interesting here is that a single "specialist" model, the SVM, actually achieved the highest test set accuracy at 0.5244. This performance narrowly beat the top ensemble methods. Both Boosting (AdaBoost) and our Stacking model tied with the specialist Neural Network at 0.5201.

However, Figure 7 (Confusion Matrix for the Stacking Model) shows a critical flaw in that model's performance. It only achieved its 52% accuracy because it developed a complete bias and just predicted "Up" for every single sample. As a result, it incorrectly classified all 335 "Down/Flat" days.

Peer Reviews¶

Reviewer Author Review of Work (Issue) What Was Great Suggestions for Improvement
Student A Student C Issue 3: Ensemble Learning • Awesome Analogies: The "Expert Council" metaphor makes stacking super easy to understand.
• Covered All the Bases: You explained Bagging, Boosting, and Stacking really well, with all the right details.
• Clear Recommendation: The reasoning for picking Stacking makes total sense and answers the main question perfectly.
1. Tiny Formatting Tweak: In the Boosting math, maybe use $\exp$ in LaTeX? Just for consistency.
2. One More Detail: It might be good to add a note that the final models in stacking get retrained on all the data before you use them.
Student B Student A Issue 1: Optimizing Hyperparameters • Good Financial Focus: I liked how you kept bringing it back to temporal validation and finance-specific settings.
• Clear Process: The "Strategic, Intelligent, Validation" framework was a really clear and smart way to explain it.
• Helpful Summary: That comparison table at the end is a great, quick reference.
1. Connecting the Dots: Maybe try to link the first abstract formula more directly to how cross-validation actually uses it.
2. Readability: For the "Critical Questions" part, you could use bullet points to make it a bit faster to read.
Student C Student B Issue 2: Bias-Variance Tradeoff • Really Deep Dive: Loved how you connected the classic theory to newer stuff like "Double Descent."
• Great Framing: The "Two Types of Strategy Failure" idea was a perfect way to explain under/overfitting.
• Spot-on Analogy: Describing regularization as "portfolio construction for factors" was perfect for a finance crowd.
A Small Structural Idea: Maybe add a quick, simple summary of "double descent" to the non-technical part? It would help prep the strategists for the more advanced idea later.

Marketing Alpha¶

The problem, in a data-chocked market today, is not that we don’t have the information; but that we fail to interpret it. While these models are subject to overfitting (breaking point of live trading) and underfitting (not capturing complex market dynamics), traditional quantitative models invariably fail.

We start out with a more or less hand-picked, well-motivated and complimentary toolkit where each model is designed to attack a specific problem in financial forecasting:

Stage 1

However, if we are to trust a model’s output, then we must be confident that the model has been calibrated correctly for the specific and non-stationary nature of financial markets. We accomplish this through:

- Hyperparameter Optimization (Tuning the Engine): Default settings are not used. Every single model is carefully optimized thanks to advanced search algorithms (such as Bayesian Optimization) in order to get the best hyper parameters. Think of this as akin to tweaking the settings of a high-performance engine: You reconfigure parameters so that power and efficiency are optimized.

  • Financial Reality Validation (Real-World Stress Testing) A model that works on a static backtest is worthless. We evaluate all of these configurations using Fired Temporal Cross-Validation which honors the forward flow of time (training from the past, firing to predict on the future) though keeping real world deployment in mind.

  • Mastering the Bias-Variance Tradeoff (Avoiding the Two Traps): Our basic risk-management principle is to fine tune a trade-off between overly simple model (High Bias) and an overly complex one (High Variance). By regularizing (L1/Lasso to select signals, L2/Ridge for stability), we are making our models focus on the strongest (in terms of robustness and economic intuition) relationships and not just chase noise.

Stage 2

We adopt a diversified set of specialist models. Each is responsible for answering a different kind of question, so that we can suitably choose the variety of tool every time.

  • For Regime Detection (K-Means Clustering)

This unsupervised model acts as our "market regime detector." Through examining market data, it can automatically identify firm structures not visible to the human eye, such discontinuous environment "high-volatility/risk-off" versus "low-volatility/risk-on";and it groups thousands of assets into distinct behavioral clusters. This allows strategy to be tailored for the current 'market state'.

  • For Interpretable Strategy (Classification Trees)

This model serves as our "transparent strategist." It produces plain and simple, human- readable if-then rules (for example, "IF P/E > 25 AND Momentum < 0,THEN signal = Sell"). This provides a level of transparency that has not previously existed and allows fund managers the chance to really know why a model makes the decisions it does – essential for winning trust and letting humans exercise supervision.

  • For Robust Factor Selection (Elastic Net Regression)

This model is the best of both worlds for selecting predictive factors. It automatically conducts variable selection like Lasso (zeros out noisy factors) and also deals with groups of correlated factors like Ridge (keeps estimates from becoming wild). The result is a model which is clean, robust, and interpretable.

  • For Complex Pattern Recognition (Support Vector Machines)

This model is our expert pattern- recognizer finding those crossing points that are not obvious, adept at identifying complex patterns that other models miss. It can, for instance, trace a fine line separating "healthy" growth stocks from "value traps."

  • For Deep Dynamic Analysis (Neural Networks)

This is our deep-learning analyst, capable of absorbing vast amounts of information and coming out with a subtle, non- linear view of the dynamics within and between factors. This is different from all other methods.

Stage 3

Our final and most powerful step is to realize that no single model, or "specialist," is perfection itself. To rely solely on one model is a critical single point of weakness.

It is with this in mind that we make use of Stacking (Stacked Generalization), a cutting- edge ensemble method which we call our "Expert Council."

  • Level 0 (The Specialists): First of all we train our diverse team of specialists (Elastic Net, SVM,NN, etc.) on raw market data.

  • Level 1 (The Chairperson): We then train a "Chairperson" (a meta- learner model ). Its one and only job is not to look at the market, but to learn the strengths and weaknesses of each specialist.

Conclusion¶

This study applied hyperparameter optimization, bias-variance tradeoff analysis, and ensemble learning to financial time-series data. The empirical results validated key theoretical concepts, including the classic U-shaped bias-variance curve, the minimization of test error through L1/L2 regularization, and the modern "double descent" phenomenon in over-parameterized neural networks.

Despite these theoretical confirmations, building a robust classifier for SPY stock data proved unsuccessful. Hyperparameter optimization selected a top-performing SVM model with a cross-validation score of 0.5185, yet its final test accuracy of 54.4% was compromised by a significant predictive bias. This challenge persisted across all ensemble methods, which produced counter-intuitive outcomes. The sophisticated Stacking model failed entirely, achieving a 52.0% accuracy by predicting "Up" 100% of the time. A single SVM model performed marginally better at 52.4% accuracy than both Boosting and Stacking, which each achieved 52.0%. The uniform failure of every model to correctly predict "Down/Flat" days indicates that the feature set, which consisted of common technical indicators, did not capture meaningful signals for forecasting non-positive returns during the test period.

References¶

Belkin, Mikhail, et al. "Reconciling Modern Machine Learning and the Bias-Variance Trade-off." Proceedings of the National Academy of Sciences, vol. 116, no. 32, 2019, pp. 15849-54.

Bergstra, James, and Yoshua Bengio. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, vol. 13, 2012, pp. 281-305.

Breiman, Leo. "Bagging Predictors." Machine Learning, vol. 24, no. 2, 1996, pp. 123-40.

Cortes, Corinna, and Vladimir Vapnik. "Support-Vector Networks." Machine Learning, vol. 20, 1995, pp. 273-97.

Dar, Yehuda, et al. "A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning." arXiv, 2021, arXiv:2109.02355.

Fischer, Thomas, and Christopher Krauss. "Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions." European Journal of Operational Research, vol. 270, no. 2, 2018, pp. 654-69.

Frazier, Peter I. "A Tutorial on Bayesian Optimization." arXiv, 2018, arXiv:1807.02811.

Freund, Yoav, and Robert E. Schapire. "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." Journal of Computer and System Sciences, vol. 55, no. 1, 1997, pp. 119-39.

Friedman, Jerome H. "Regularized Discriminant Analysis." Journal of the American Statistical Association, vol. 84, no. 405, 1989, pp. 165-75.

Goodfellow, Ian, et al. Deep Learning. MIT Press, 2016.

Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.

James, Gareth, et al. An Introduction to Statistical Learning. Springer, 2021.

Jemai, Jaber, and Ali Daud. "Bias–Variance Tradeoff Decomposition-Based Machine Learning Model Selection: Application to Credit Risk Analysis." International Journal of Data Science and Analytics, 2025, pp. 1-14.

Odegua, Richard. "An Empirical Study of Ensemble Techniques (Bagging, Boosting, and Stacking)." Proceedings of the 2019 International Conference on Data Science and Communication, 2019, pp. 1-5.

Snoek, Jasper, et al. "Practical Bayesian Optimization of Machine Learning Algorithms." Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 2951-59.

Tay, Francis E.H., and Lijuan Cao. "Application of Support Vector Machines in Financial Time Series Forecasting." Omega, vol. 29, no. 4, 2001, pp. 309-17.

Timmermann, Allan. "Forecasting Methods in Finance." Annual Review of Financial Economics, vol. 10, 2018, pp. 449-79.

Wolpert, David H. "Stacked Generalization." Neural Networks, vol. 5, no. 2, 1992, pp. 241-59.