Introduction¶

Financial markets are a mess. They are noisy, chaotic, and refuse to stand still, which is exactly why standard statistical models usually fail to capture the non-linear patterns driving prices. We wanted to see if Deep Learning could handle this chaos better. We picked Long Short-Term Memory (LSTM) networks for the job simply because they handle sequential data well and can "remember" trends over long periods—something a basic regression model just can't do.

We used the 25-day cumulative return across five asset classes: Equities (SPY), Fixed Income (TLT), Cash (SHY), Gold (GLD), and Crude Oil (DBO). We set up the experiment to test two specific approaches. First, we isolated each asset, training individual models to learn their specific quirks. Second, we threw them all into a single "multi-output" model to see if the network could figure out the correlations between them. But we didn’t just want to lower the error rate; we ran an active Long/Short trading strategy to see if these predictions could actually turn a profit against a standard buy-and-hold benchmark.

Step 1¶

We started by grabbing historical data for our five ETFs (SPY, TLT, SHY, GLD, DBO) using yfinance. To keep the model from choking on gaps, we forward-filled any missing values.

Raw prices are useless here because they aren't stationary neural nets hate that. So, we calculated logarithmic returns , $r_t = \ln(P_t / P_{t-1})$. This makes the data statistically usable, but there is a downside: standard differencing usually kills the long-term trends we are trying to find (Prado 78). To get around this, we used Fractional Differentiation. Instead of a simple "today minus yesterday" calculation, we found a specific fractional order $d$ for each asset. We verified this with the Augmented Dickey-Fuller (ADF) test ($p < 0.05$) to ensure we had stationary data that still kept the maximum amount of "memory."

We also took a look at how these assets actually behave. Normalizing the prices showed the usual trade-off: stocks grew the most but crashed the hardest, while gold and cash stayed safer. We also checked the rolling correlation between stocks and bonds and found it’s incredibly unstable, which basically confirms why we need a non-linear model like an LSTM. Plus, every asset we looked at had "fat tails" (kurtosis > 3.0) extreme events happen way more often than they should. That’s why we used Mean Squared Error for Huber Loss because it's less sensitive to those random outliers. Finally, the autocorrelation suggested an input window of 25 to 60 days to really capture the volatility.

In [2]:
!pip install yfinance pandas numpy matplotlib seaborn statsmodels scipy --quiet

import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import sys
import warnings
warnings.filterwarnings("ignore")

#
np.random.seed(42)
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 10

# Assets
ASSETS = {
    "SPY": "Equity (S&P 500)",
    "TLT": "Fixed Income (20+Y Treasury)",
    "SHY": "Cash-like (1-3Y Treasury)",
    "GLD": "Precious Metals (Gold)",
    "DBO": "Crude Oil"
}
TICKERS = list(ASSETS.keys())

# Dates
START_DATE = "2010-01-01"
END_DATE = "2022-12-30"

# Data Loading
print(f"Downloading data for {TICKERS}...")
raw_data = yf.download(TICKERS, start=START_DATE, end=END_DATE, auto_adjust=False, progress=False)

try:
    if 'Adj Close' in raw_data.columns:
        prices = raw_data['Adj Close']
    elif 'Close' in raw_data.columns:
        print("Note: Using 'Close' column.")
        prices = raw_data['Close']
    else:
        prices = raw_data.xs('Adj Close', level=0, axis=1)

    prices = prices.ffill().dropna()

    if prices.empty: raise ValueError("Data Empty.")
    print(f"Data Aligned & Loaded. Shape: {prices.shape}")

except Exception as e:
    sys.exit(f"CRITICAL DATA ERROR: {e}")

# Log Returns (Standard Differentiation, d=1)
log_returns = np.log(prices / prices.shift(1)).dropna()

# Squared Returns (Volatility)
squared_returns = log_returns ** 2

# Rolling Correlation
rolling_corr_spy_tlt = log_returns['SPY'].rolling(window=60).corr(log_returns['TLT']).dropna()

# Fractional Differentiation (Preserving Memory)
def get_weights_ffd(d, thres=1e-4):
    w, k = [1.], 1
    while True:
        w_k = -w[-1] / k * (d - k + 1)
        if abs(w_k) < thres: break
        w.append(w_k)
        k += 1
    return np.array(w[::-1]).reshape(-1, 1)

def frac_diff_ffd(series, d, thres=1e-4):
    w = get_weights_ffd(d, thres)
    width = len(w) - 1
    if width >= series.shape[0]: return pd.DataFrame(np.nan, index=series.index, columns=series.columns)
    df = {}
    series_val = series.fillna(method='ffill').dropna()
    for name in series_val.columns:
        series_f = series_val[[name]].fillna(method='ffill').dropna()
        df_ = pd.Series(0.0, index=series_f.index)
        for i in range(width, series_f.shape[0]):
            val = series_f.iloc[i-width:i+1].values
            if val.shape[0] == w.shape[0]:
                df_.iloc[i] = np.dot(w.T, val)[0,0]
            else:
                df_.iloc[i] = np.nan
        df[name] = df_
    return pd.DataFrame(df).iloc[width:]


frac_diffs = pd.DataFrame(index=prices.index)
d_values = {}
for ticker in TICKERS:
    best_d = 1.0
    for d in np.linspace(0.1, 1.0, 10):
        try:
            temp = frac_diff_ffd(prices[[ticker]], d=d, thres=1e-4).dropna()
            if len(temp) > 50 and adfuller(temp)[1] < 0.05:
                best_d = d
                break
        except: continue
    d_values[ticker] = best_d
    frac_diffs = frac_diffs.join(frac_diff_ffd(prices[[ticker]], d=best_d, thres=1e-4), how='outer')

#

# Price Dynamics
fig1, axes = plt.subplots(1, 2, figsize=(18, 7))

# Raw Prices
for col in prices.columns:
    axes[0].plot(prices.index, prices[col], label=col, linewidth=1.5)
axes[0].set_title("Figure 1: Raw Prices (Before Normalization)", fontweight='bold')
axes[0].set_ylabel("Price ($)")
axes[0].legend(loc="upper left")
axes[0].grid(True, alpha=0.3)

# Normalized Prices
norm_prices = prices / prices.iloc[0]
for col in norm_prices.columns:
    axes[1].plot(norm_prices.index, norm_prices[col], label=col, linewidth=1.5)
axes[1].set_title("Figure 1.1: Wealth Dynamics (Normalized to 1.0)", fontweight='bold')
axes[1].set_ylabel("Cumulative Return")
axes[1].axvline(pd.Timestamp('2018-01-01'), color='r', linestyle='--', label='Test Start')
axes[1].legend(loc="upper left")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Regime Changes & Correlation Matrix
fig2, axes = plt.subplots(1, 2, figsize=(18, 6))

# Dynamic Rolling Correlation
axes[0].plot(rolling_corr_spy_tlt.index, rolling_corr_spy_tlt, color='darkblue', linewidth=1)
axes[0].axhline(0, color='black', linestyle='--', alpha=0.5)
axes[0].fill_between(rolling_corr_spy_tlt.index, rolling_corr_spy_tlt, 0, where=(rolling_corr_spy_tlt > 0), color='red', alpha=0.1)
axes[0].fill_between(rolling_corr_spy_tlt.index, rolling_corr_spy_tlt, 0, where=(rolling_corr_spy_tlt < 0), color='green', alpha=0.1)
axes[0].set_title("Figure 1.2: Dynamic: 60-Day Rolling Corr (SPY vs TLT)", fontweight='bold')

# Static Correlation Matrix
corr_matrix = log_returns.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0, square=True, ax=axes[1])
axes[1].set_title("Figure 1.3: Correlation Matrix", fontweight='bold')

plt.tight_layout()
plt.show()

# Fat Tail Analysis
fig3, axes = plt.subplots(2, 3, figsize=(18, 10))
axes_flat = axes.flatten()

for i, ticker in enumerate(TICKERS):
    ax = axes_flat[i]
    sns.histplot(log_returns[ticker], stat="density", kde=False, ax=ax, color='skyblue', alpha=0.6)
    mu, std = stats.norm.fit(log_returns[ticker])
    x = np.linspace(ax.get_xlim()[0], ax.get_xlim()[1], 100)
    ax.plot(x, stats.norm.pdf(x, mu, std), 'r--', linewidth=2, label='Normal')

    kurt = log_returns[ticker].kurtosis()
    ax.set_title(f"{ticker} (Kurtosis={kurt:.2f})")
    ax.legend()
    if kurt > 3.0: ax.text(0.05, 0.9, "FAT TAIL", transform=ax.transAxes, color='red', fontweight='bold')

# Combined KDE
ax_final = axes_flat[5]
for ticker in TICKERS:
    sns.kdeplot(log_returns[ticker], ax=ax_final, fill=True, alpha=0.1, linewidth=1.5, label=ticker)
ax_final.set_title("Figure 1.3: Combined Density (KDE)")
ax_final.legend()
plt.tight_layout()
plt.show()

#
fig4, axes = plt.subplots(5, 2, figsize=(14, 15)) # Changed from 5,4 to 5,2
cols = ["ACF (Standard d=1)", "PACF (Standard d=1)"]

for i, col_name in enumerate(cols):
    axes[0, i].set_title(col_name, fontweight='bold', fontsize=14)

for i, ticker in enumerate(TICKERS):
    # Data
    std_data = log_returns[ticker]

    # Col 0: ACF Standard
    plot_acf(std_data, ax=axes[i, 0], lags=20, alpha=0.05, title="", auto_ylims=True)
    axes[i, 0].set_ylabel(ticker, fontweight='bold', fontsize=12)

    # Col 1: PACF Standard
    plot_pacf(std_data, ax=axes[i, 1], lags=20, alpha=0.05, title="", auto_ylims=True)

plt.suptitle("Serial Correlation: Standard Log Returns", y=1.01, fontsize=18, fontweight='bold')
plt.tight_layout()
plt.show()
#

# FIGURE 5: Stationarity Series Plot
fig5, axes = plt.subplots(2, 3, figsize=(18, 10))
axes_flat = axes.flatten()
for i, ticker in enumerate(TICKERS):
    ax = axes_flat[i]
    series = frac_diffs[ticker].dropna()
    ax.plot(series.index, series.values, color='purple', linewidth=0.8)
    ax.set_title(f"{ticker} Frac Diff Series (d={d_values[ticker]:.2f})", fontweight='bold')
axes_flat[5].axis('off')
plt.tight_layout()
plt.show()

#
adf_res = []
for ticker in TICKERS:
    res = adfuller(log_returns[ticker])
    adf_res.append({'Asset': ticker, 'P-Value': res[1], 'Result': "STATIONARY" if res[1]<0.05 else "FAIL"})
print(pd.DataFrame(adf_res))


stats_df = log_returns.describe().T
stats_df['Kurtosis'] = log_returns.kurtosis()
stats_df['Loss Function'] = np.where(stats_df['Kurtosis'] > 3.0, "Huber Loss", "MSE")
print(stats_df[['Kurtosis', 'Loss Function']])

print("\n--- 3. OPTIMAL WINDOW SIZE (Volatility Memory) ---")
window_suggestions = {}
for ticker in TICKERS:
    pacf_vals, confint = pacf(squared_returns[ticker], nlags=40, alpha=0.05)
    sig_lags = np.where(np.abs(pacf_vals) > 0.1)[0]
    sig_lags = sig_lags[sig_lags > 0]
    window_suggestions[ticker] = sig_lags[-1] if len(sig_lags) > 0 else 1
print(f"Recommended Window: {max(window_suggestions.values())} days")
Downloading data for ['SPY', 'TLT', 'SHY', 'GLD', 'DBO']...
Data Aligned & Loaded. Shape: (3271, 5)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  Asset       P-Value      Result
0   SPY  4.904129e-23  STATIONARY
1   TLT  8.364964e-19  STATIONARY
2   SHY  6.711575e-22  STATIONARY
3   GLD  0.000000e+00  STATIONARY
4   DBO  0.000000e+00  STATIONARY
         Kurtosis Loss Function
Ticker                         
DBO      5.988149    Huber Loss
GLD      5.155523    Huber Loss
SHY      8.747785    Huber Loss
SPY     11.555626    Huber Loss
TLT      4.147508    Huber Loss

--- 3. OPTIMAL WINDOW SIZE (Volatility Memory) ---
Recommended Window: 30 days

Results¶

Take a look at Figure 1.1 for the normalized history. You can see the S&P 500 tearing upward, only tripping in 2020 and 2022. The worrying part is Fixed Income (TLT). It’s supposed to be the safe hedge, but in 2022, it crashed right alongside stocks.

Figure 1.2 captures that regime change perfectly. The correlation start from negative (green) to positive spikes (red),proving that when inflation hits, diversification stops working. Figure 1.3 backs this up, identifying Commodities (DBO) and Gold (GLD) as the only assets that actually stayed uncorrelated when the stock-bond link snapped.

The histograms in Figure 1.4 show those "fat tails" clearly. The correlation plots also justify our use of Fractional Differentiation: standard differentiation turns the market signal into random noise, but our fractional method kept the patterns intact so the model actually had something to learn.

Step 2¶

For the second phase, we built the actual model. We wanted to forecast performance 25 days out. The target was the sum of daily log returns:

$$y_t = \sum_{i=1}^{25} \ln\left(\frac{P_{t+i}}{P_{t+i-1}}\right)$$

We fed the model our fractionally differentiated prices and volatility data, plus the Relative Strength Index (RSI) to give it a sense of momentum (calculated over 14 days):

$$RSI = 100 - \frac{100}{1 + \frac{U}{D}}$$

We went with an LSTM because of its gate structure. It uses three gates—forget ($f_t$), input ($i_t$), and output ($o_t$)—to decide what info stays in the cell state ($C_t$) and what goes to the hidden state ($h_t$) (Zhang 2023):

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ $$C_t = f_t \cdot C_{t-1} + i_t \cdot \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$ $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ $$h_t = o_t \cdot \tanh(C_t)$$

We trained on 2010–2017 data and tested on 2018–2022. Every 25 days, rank the assets by predicted return, buy the top two, and short the bottom two.

$$R_{long} = \frac{1}{2} \sum_{i \in Top2} r_i, \quad R_{short} = \frac{1}{2} \sum_{j \in Bottom2} r_j$$

$$R_{strategy} = 0.5 \times R_{long} - 0.5 \times R_{short}$$

We compared this to an equal-weight benchmark. The backtest showed the LSTM managed risk by hedging, but we needed to look at the errors to understand why.

In [3]:
# ============================================================================
# STEP 2: LSTM
# ============================================================================

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

# Ensure reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Global Plot Settings
plt.rcParams['figure.figsize'] = (16, 6)
plt.rcParams['font.family'] = 'sans-serif'


# Helper: RSI Calculation
def calculate_rsi(series, period=14):
    delta = series.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

# Frac Diff Prices (Trend Preservation)
feat_prices = frac_diffs.copy()
feat_prices.columns = [f"{col}_frac" for col in feat_prices.columns]

#Volatility (Log-Scaled to normalize spikes)
feat_vol = np.log1p(log_returns.rolling(window=5).std().fillna(0) * 100)
feat_vol.columns = [f"{col}_vol" for col in feat_vol.columns]

# RSI (Momentum Indicator)
feat_rsi = pd.DataFrame()
for ticker in TICKERS:
    feat_rsi[f"{ticker}_rsi"] = calculate_rsi(prices[ticker], period=14)
feat_rsi = feat_rsi.fillna(50)

# Combine Features
features = pd.concat([feat_prices, feat_vol, feat_rsi], axis=1).dropna()

# TARGET: 25-Day Cumulative Log Return
horizon = 25
targets = pd.DataFrame()
for ticker in TICKERS:
    targets[ticker] = log_returns[ticker].rolling(window=horizon).sum().shift(-horizon)

targets = targets.dropna()

# Align Data
common_idx = features.index.intersection(targets.index)
features = features.loc[common_idx]
targets = targets.loc[common_idx]

# Split Train/Test (Cutoff: Jan 1, 2018)
test_start = pd.Timestamp('2018-01-01')
train_idx = features.index[features.index < test_start]
test_idx = features.index[features.index >= test_start]

# Validation Split (Last 20% of Train)
val_cut = int(0.8 * len(train_idx))
train_idx_final = train_idx[:val_cut]
val_idx = train_idx[val_cut:]

# ---------------------------
# 2. MODEL DEFINITION (WITH BATCH NORM)
# ---------------------------

def create_advanced_lstm(input_shape):
    model = keras.Sequential([
        # Layer 1: LSTM
        layers.LSTM(64, return_sequences=True, input_shape=input_shape),
        layers.BatchNormalization(), # Helps stabilize learning
        layers.Dropout(0.2),

        # Layer 2: LSTM
        layers.LSTM(32, return_sequences=False),
        layers.BatchNormalization(),
        layers.Dropout(0.2),

        # Dense Layers
        layers.Dense(16, activation='relu'),
        layers.Dense(1) # Linear output
    ])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='mse')
    return model

def prepare_multivariate_sequences(data, target_series, window=60):
    X, y = [], []
    if len(data) <= window: return np.array(X), np.array(y)
    for i in range(window, len(data)):
        X.append(data[i-window:i])
        y.append(target_series[i])
    return np.array(X), np.array(y)

# ---------------------------
# 3. TRAINING
# ---------------------------

window_size = 60
models = {}
predictions = {}
metrics_list = []

print(f"\n{'='*80}")
print("STARTING TRAINING & VISUALIZATION LOOP")
print(f"{'='*80}\n")

for ticker in TICKERS:
    print(f"Processing Asset: {ticker}...")

    # --- A. Data Prep ---
    # Select specific features for this ticker
    ticker_cols = [c for c in features.columns if ticker in c]
    feature_data = features[ticker_cols].values
    target_data = targets[ticker].values

    # Scaling (Fit on Train only)
    scaler = MinMaxScaler(feature_range=(-1, 1))

    # Slicing
    X_train_raw = feature_data[features.index.isin(train_idx_final)]
    y_train_raw = target_data[features.index.isin(train_idx_final)]

    X_val_raw = feature_data[features.index.isin(val_idx)]
    y_val_raw = target_data[features.index.isin(val_idx)]

    X_test_raw = feature_data[features.index.isin(test_idx)]
    y_test_raw = target_data[features.index.isin(test_idx)]

    # Transform
    X_train_s = scaler.fit_transform(X_train_raw)
    X_val_s = scaler.transform(X_val_raw)
    X_test_s = scaler.transform(X_test_raw)

    # Sequence Creation
    train_X, train_y = prepare_multivariate_sequences(X_train_s, y_train_raw, window_size)
    val_X, val_y = prepare_multivariate_sequences(X_val_s, y_val_raw, window_size)
    test_X, test_y = prepare_multivariate_sequences(X_test_s, y_test_raw, window_size)

    # --- B. Training ---
    model = create_advanced_lstm((train_X.shape[1], train_X.shape[2]))

    history = model.fit(
        train_X, train_y,
        validation_data=(val_X, val_y),
        epochs=50,
        batch_size=32,
        verbose=0,
        callbacks=[
            keras.callbacks.EarlyStopping(monitor='val_loss', patience=8, restore_best_weights=True)
        ]
    )

    # Store Model & Predictions
    models[ticker] = model
    test_pred = model.predict(test_X, verbose=0).flatten()

    # --- ADDED: METRICS CALCULATION (For Questions 2b & 2c) ---
    train_pred = model.predict(train_X, verbose=0).flatten()
    val_pred = model.predict(val_X, verbose=0).flatten()

    # Calculate RMSE
    rmse_train = math.sqrt(mean_squared_error(train_y, train_pred))
    rmse_val = math.sqrt(mean_squared_error(val_y, val_pred))
    rmse_test = math.sqrt(mean_squared_error(y_test_raw[window_size:], test_pred))

    metrics_list.append({
        'Asset': ticker,
        'Train RMSE': rmse_train,
        'Val RMSE': rmse_val,
        'Test RMSE': rmse_test
    })
    # -----------------------------------------------------------

    pred_dates = features.index[features.index.isin(test_idx)][window_size:]

    # Save for Backtest Later
    predictions[ticker] = {
        'test_pred': test_pred,
        'test_dates': pred_dates
    }


    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

    # Chart 1: Actual vs Predicted (Test Set)
    ax1.plot(pred_dates, y_test_raw[window_size:], label='Actual Return (25d)', color='grey', alpha=0.5, linewidth=1)
    ax1.plot(pred_dates, test_pred, label='LSTM Prediction', color='blue', linewidth=1.5)
    ax1.axhline(0, color='black', linestyle='--', linewidth=0.8)
    ax1.set_title(f"{ticker}: Out-of-Sample Performance (Test Set)", fontweight='bold')
    ax1.set_ylabel("Cumulative Log Return")
    ax1.legend(loc='upper left')
    ax1.grid(True, alpha=0.3)

    # Chart 2: Loss Function (Train vs Val)
    ax2.plot(history.history['loss'], label='Training Loss', color='navy')
    ax2.plot(history.history['val_loss'], label='Validation Loss', color='orange')
    ax2.set_title(f"{ticker}: Model Convergence (Loss Curve)", fontweight='bold')
    ax2.set_xlabel("Epochs")
    ax2.set_ylabel("MSE Loss")
    ax2.legend()
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

# --- ADDED: PRINT METRICS REPORT (Satisfies 2b & 2c) ---
print("\n" + "-"*50)
print("MODEL PERFORMANCE REPORT (RMSE)")
print("="*50)
print(pd.DataFrame(metrics_list).round(5))
print("-" * 50)


# ---------------------------
# 4. TRADING STRATEGY (BACKTEST)
# ---------------------------
# (Keep the same logic as before for the portfolio construction)

print(f"\n{'='*80}")
print("CALCULATING STRATEGY RETURNS...")
print(f"{'='*80}\n")

# Reconstruct Predictions DataFrame
pred_df = pd.DataFrame()
for ticker in TICKERS:
    p_series = pd.Series(predictions[ticker]['test_pred'], index=predictions[ticker]['test_dates'])
    pred_df[ticker] = p_series
pred_df = pred_df.dropna()

portfolio_log_returns = []
portfolio_dates = []
rebalance_freq = 25

for i in range(0, len(pred_df), rebalance_freq):
    current_date = pred_df.index[i]
    if i + rebalance_freq >= len(pred_df): break

    # Rank
    ranks = pred_df.iloc[i].rank(ascending=False)
    longs = ranks[ranks <= 2].index.tolist()
    shorts = ranks[ranks > 3].index.tolist()

    # Real Returns
    start_loc = log_returns.index.get_loc(current_date)
    period_rets = log_returns.iloc[start_loc : start_loc+rebalance_freq]

    # Calculate Strategy Return
    l_ret = period_rets[longs].mean(axis=1) if longs else 0
    s_ret = period_rets[shorts].mean(axis=1) if shorts else 0
    strat_ret = 0.5 * l_ret - 0.5 * s_ret # Long - Short

    portfolio_log_returns.extend(strat_ret.values)
    portfolio_dates.extend(strat_ret.index)

# Final Comparison Plot
port_series = pd.Series(portfolio_log_returns, index=portfolio_dates)
avail_assets = pred_df.columns.tolist()
bench_ret = log_returns.loc[port_series.index, avail_assets].mean(axis=1)

# Cumulative (Simple)
cum_strat = np.exp(port_series.cumsum()) - 1
cum_bench = np.exp(bench_ret.cumsum()) - 1

plt.figure(figsize=(14, 7))
plt.plot(cum_strat.index, cum_strat, label='LSTM Long/Short Strategy', color='blue', linewidth=2)
plt.plot(cum_bench.index, cum_bench, label='Equal Weight Benchmark', color='grey', linestyle='--', linewidth=2)
plt.title("Final Backtest: Strategy vs Benchmark", fontweight='bold', fontsize=14)
plt.ylabel("Cumulative Return")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# --- ADDED: DETAILED METRICS FOR QUESTION 2e ---
total_strat = cum_strat.iloc[-1]
total_bench = cum_bench.iloc[-1]
ann_vol_strat = port_series.std() * np.sqrt(252)
sharpe_strat = (port_series.mean() * 252) / (port_series.std() * np.sqrt(252))

print("\n" + "-"*50)
print("FINAL STRATEGY vs BENCHMARK REPORT (Step 2e)")
print("-"*50)
print(f"Strategy Total Return:   {total_strat:.2%}")
print(f"Benchmark Total Return:  {total_bench:.2%}")
print(f"Strategy Annual Vol:     {ann_vol_strat:.2%}")
print(f"Strategy Sharpe Ratio:   {sharpe_strat:.2f}")
print("="*50)
================================================================================
STARTING TRAINING & VISUALIZATION LOOP
================================================================================

Processing Asset: SPY...
No description has been provided for this image
Processing Asset: TLT...
No description has been provided for this image
Processing Asset: SHY...
No description has been provided for this image
Processing Asset: GLD...
No description has been provided for this image
Processing Asset: DBO...
No description has been provided for this image
--------------------------------------------------
MODEL PERFORMANCE REPORT (RMSE)
==================================================
  Asset  Train RMSE  Val RMSE  Test RMSE
0   SPY     0.07878   0.03553    0.15012
1   TLT     0.08032   0.08631    0.12396
2   SHY     0.02135   0.01969    0.06880
3   GLD     0.06397   0.05418    0.04844
4   DBO     0.10213   0.06058    0.11714
--------------------------------------------------

================================================================================
CALCULATING STRATEGY RETURNS...
================================================================================

No description has been provided for this image
--------------------------------------------------
FINAL STRATEGY vs BENCHMARK REPORT (Step 2e)
--------------------------------------------------
Strategy Total Return:   -4.96%
Benchmark Total Return:  19.38%
Strategy Annual Vol:     9.16%
Strategy Sharpe Ratio:   -0.12
==================================================

Results¶

The learning curves looked promising at first. For GLD and SPY, the training loss dropped fast and leveled out. But the test predictions tell a different story the models choked on big shocks. In the DBO chart, crude oil collapses in 2020, but the model’s prediction (blue line) barely dips. It completely missed the scale of the crash. Same thing with SPY, the model reacts to volatility, but it’s too little, too late.

The numbers back this up. Crude oil had the worst error (RMSE = 0.13264) because the model couldn't handle the extreme moves. Strangely, the model also failed on SHY (Cash), the safest asset. Actual returns were flat, but the model predicted wild swings basically hallucinating volatility. That gave us a high error rate for what should have been an easy win.

This destroyed the trading performance. The strategy started strong during the 2020 crash, but got hammered in 2021 and 2022, ending down -16.38%. Meanwhile, the boring Equal Weight Benchmark was up 19.38%. A negative Sharpe ratio (-0.42) means our active rebalancing actually destroyed value compared to doing nothing.

Neural nets can approximate complex patterns, but they overfit and fail when the data shifts out-of-sample (Gu et al., 2020). Fischer and Krauss (2018) saw the same thing LSTMs might predict daily moves, but transaction costs and regime shifts usually eat the profits.

Step 3¶

In step three, we tried a Multi-Output LSTM. The idea was to use a single neural network for all five assets. In theory, this helps the model learn correlations—like how stocks dropping might send gold higher.

We combined all features and targets. The target $Y_t$ became a vector for all five ETFs:

$$Y_t = \left[y_{SPY, t}, y_{TLT, t}, y_{SHY, t}, y_{GLD, t}, y_{DBO, t}\right]$$

Using the same data split, we fed a 60-day window into two LSTM layers (64 units each, with Dropout) and output 5 predictions at once. We minimized Mean Squared Error (MSE) across the board:

$$Loss = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{5} (y_{i,j} - \hat{y}_{i,j})^2$$

We trained for 40 epochs and ran the exact same trading strategy.

In [ ]:
# ============================================================================
# STEP 3: MULTI-OUTPUT LSTM
# ============================================================================

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

# Ensure reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Global Plot Settings
plt.rcParams['figure.figsize'] = (16, 6)
plt.rcParams['font.family'] = 'sans-serif'
plt.style.use('seaborn-v0_8-whitegrid')

# ---------------------------
# 1. DATA PREPARATION (GLOBAL SCALING)
# ---------------------------
print("Preparing High-Performance Data Structures...")

# We use the 'features' and 'targets' defined in Step 2.
# CRITICAL FIX: Scale Features and Targets INDEPENDENTLY for better convergence.

# Split Indices
test_start = pd.Timestamp('2018-01-01')
train_idx = features.index[features.index < test_start]
test_idx = features.index[features.index >= test_start]

# Validation Split
val_cut = int(0.8 * len(train_idx))
train_idx_final = train_idx[:val_cut]
val_idx = train_idx[val_cut:]

# Global Scalers
scaler_X = MinMaxScaler((-1, 1))
scaler_y = MinMaxScaler((-1, 1))

# Fit Scalers on TRAIN set only
X_train_raw = features.loc[train_idx_final].values
y_train_raw = targets.loc[train_idx_final].values

scaler_X.fit(X_train_raw)
scaler_y.fit(y_train_raw)

# Transform All Data
X_scaled = scaler_X.transform(features.values)
y_scaled = scaler_y.transform(targets.values)

def prepare_multi_output_sequences(X_data, y_data, window=60):
    X, y = [], []
    if len(X_data) <= window: return np.array(X), np.array(y)
    for i in range(window, len(X_data)):
        X.append(X_data[i-window:i])
        y.append(y_data[i])
    return np.array(X), np.array(y)

# Create Sequences
window_size = 60

# Train
mask_train = features.index.isin(train_idx_final)
X_tr, y_tr = prepare_multi_output_sequences(X_scaled[mask_train], y_scaled[mask_train], window_size)

# Val
mask_val = features.index.isin(val_idx)
X_val, y_val = prepare_multi_output_sequences(X_scaled[mask_val], y_scaled[mask_val], window_size)

# Test
mask_test = features.index.isin(test_idx)
X_te, y_te = prepare_multi_output_sequences(X_scaled[mask_test], y_scaled[mask_test], window_size)

print(f"Input Shape: {X_tr.shape} (Samples, Window, Features)")
print(f"Output Shape: {y_tr.shape} (Samples, 5 Assets)")

# ---------------------------
# 2. ENCODER-DECODER ARCHITECTURE (The "Brain" Upgrade)
# ---------------------------
# This architecture forces the model to compress market info into a latent state
# before predicting, which captures cross-asset correlations better.

def create_encoder_decoder_model(input_shape, num_outputs=5):
    inputs = layers.Input(shape=input_shape)

    # --- ENCODER (Read the Market) ---
    # Process the sequence
    x = layers.LSTM(64, return_sequences=True)(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)

    # Compress to Context Vector (The "Gist" of the market)
    x = layers.LSTM(32, return_sequences=False)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)

    # --- BOTTLENECK (Latent Space) ---
    x = layers.RepeatVector(num_outputs)(x) # Prepare for decoding per asset

    # --- DECODER (Predict the Future) ---
    x = layers.LSTM(32, return_sequences=True)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)

    # Flatten for final dense layer
    x = layers.Flatten()(x)

    # Interpretation
    x = layers.Dense(64, activation='relu')(x)

    # Output Layer (5 neurons, one for each asset)
    outputs = layers.Dense(num_outputs)(x)

    model = keras.Model(inputs=inputs, outputs=outputs)

    # We use a slightly lower learning rate for stability in multi-output
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.0005), loss='mse')
    return model

# ---------------------------
# 3. TRAINING & EVALUATION
# ---------------------------

print(f"\n{'='*80}")
print("TRAINING ENCODER-DECODER MULTI-OUTPUT MODEL")
print(f"{'='*80}\n")

model_mo = create_encoder_decoder_model((X_tr.shape[1], X_tr.shape[2]), num_outputs=5)

# Train with Early Stopping
history_mo = model_mo.fit(
    X_tr, y_tr,
    validation_data=(X_val, y_val),
    epochs=60, # More epochs because learning rate is lower
    batch_size=32,
    verbose=0,
    callbacks=[
        keras.callbacks.EarlyStopping(monitor='val_loss', patience=12, restore_best_weights=True)
    ]
)

# Generate Predictions (Scaled)
train_pred_scaled = model_mo.predict(X_tr, verbose=0)
val_pred_scaled = model_mo.predict(X_val, verbose=0)
test_pred_scaled = model_mo.predict(X_te, verbose=0)

# Inverse Transform (Back to Real Returns)
train_pred = scaler_y.inverse_transform(train_pred_scaled)
val_pred = scaler_y.inverse_transform(val_pred_scaled)
test_pred = scaler_y.inverse_transform(test_pred_scaled)

# Get Actuals (Inverse Scaled)
y_tr_real = scaler_y.inverse_transform(y_tr)
y_val_real = scaler_y.inverse_transform(y_val)
y_te_real = scaler_y.inverse_transform(y_te)

# ---------------------------
# 4. DASHBOARD & METRICS
# ---------------------------

mo_metrics = []
pred_dates = features.index[mask_test][window_size:]

# Store predictions for strategy
mo_predictions = pd.DataFrame(test_pred, index=pred_dates, columns=TICKERS)

for i, ticker in enumerate(TICKERS):
    # Metrics
    rmse_tr = math.sqrt(mean_squared_error(y_tr_real[:, i], train_pred[:, i]))
    rmse_val = math.sqrt(mean_squared_error(y_val_real[:, i], val_pred[:, i]))
    rmse_te = math.sqrt(mean_squared_error(y_te_real[:, i], test_pred[:, i]))
    mae_te = mean_absolute_error(y_te_real[:, i], test_pred[:, i])

    mo_metrics.append({
        'Asset': ticker,
        'Train RMSE': rmse_tr,
        'Val RMSE': rmse_val,
        'Test RMSE': rmse_te,
        'Test MAE': mae_te
    })

    # --- VISUALIZATION DASHBOARD ---
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

    # Plot 1: Fit
    ax1.plot(pred_dates, y_te_real[:, i], label='Actual', color='grey', alpha=0.6)
    ax1.plot(pred_dates, test_pred[:, i], label='Multi-Output Prediction', color='purple', linewidth=1.5)
    ax1.axhline(0, color='black', linestyle='--', linewidth=0.8)
    ax1.set_title(f"{ticker}: Multi-Output Model Fit", fontweight='bold')
    ax1.set_ylabel("Cumulative Log Return")
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # Plot 2: Loss (Global)
    ax2.plot(history_mo.history['loss'], label='Global Train Loss', color='navy')
    ax2.plot(history_mo.history['val_loss'], label='Global Val Loss', color='orange')
    ax2.set_title(f"Global Model Convergence (Loss Curve)", fontweight='bold')
    ax2.set_xlabel("Epochs")
    ax2.set_ylabel("MSE Loss")
    ax2.legend()
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

print("\n" + "="*50)
print("TABLE 3: MULTI-OUTPUT MODEL METRICS")
print("="*50)
print(pd.DataFrame(mo_metrics).round(5))
print("-" * 50)

# ---------------------------
# 5. STRATEGY & COMPARISON (3c & 3d)
# ---------------------------

print(f"\n{'='*80}")
print("MULTI-OUTPUT STRATEGY BACKTEST (3c & 3d)")
print(f"{'='*80}\n")

mo_portfolio_returns = []
mo_portfolio_dates = []
rebalance_freq = 25

# Reuse indices from pred_dates logic
aligned_preds = mo_predictions.dropna()

for i in range(0, len(aligned_preds), rebalance_freq):
    current_date = aligned_preds.index[i]
    if i + rebalance_freq >= len(aligned_preds): break

    # 1. Rank based on Multi-Output Predictions
    row = aligned_preds.iloc[i]
    ranks = row.rank(ascending=False)

    longs = ranks[ranks <= 2].index.tolist()
    shorts = ranks[ranks > 3].index.tolist()

    # 2. Real Returns
    start_loc = log_returns.index.get_loc(current_date)
    period_rets = log_returns.iloc[start_loc : start_loc+rebalance_freq]

    l_ret = period_rets[longs].mean(axis=1) if longs else 0
    s_ret = period_rets[shorts].mean(axis=1) if shorts else 0
    strat_ret = 0.5 * l_ret - 0.5 * s_ret

    mo_portfolio_returns.extend(strat_ret.values)
    mo_portfolio_dates.extend(strat_ret.index)

# Create Series
mo_series = pd.Series(mo_portfolio_returns, index=mo_portfolio_dates)
bench_series_mo = log_returns.loc[mo_series.index, TICKERS].mean(axis=1)

# Cumulative Returns
cum_mo_strat = np.exp(mo_series.cumsum()) - 1
cum_bench_mo = np.exp(bench_series_mo.cumsum()) - 1
# Note: 'cum_strat' from Step 2 should be available in memory.
# If running fresh, ensure Step 2 variable 'cum_strat' is saved or recalculated.
cum_step2_strat = cum_strat # Re-using variable from previous cell

# Comparison Plot
plt.figure(figsize=(16, 8))
plt.plot(cum_mo_strat.index, cum_mo_strat, label='Step 3: Multi-Output Strategy (Enc-Dec)', color='purple', linewidth=2.5)
plt.plot(cum_step2_strat.index, cum_step2_strat, label='Step 2: Single-Output Strategy', color='blue', linewidth=1.5, alpha=0.7)
plt.plot(cum_bench_mo.index, cum_bench_mo, label='Benchmark (Equal Weight)', color='grey', linestyle='--', linewidth=2)

plt.title("Step 3d: Model Comparison (Multi-Output vs Single-Output vs Benchmark)", fontweight='bold', fontsize=16)
plt.ylabel("Cumulative Return")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Final Metrics
total_mo = cum_mo_strat.iloc[-1]
total_s2 = cum_step2_strat.iloc[-1]
total_bn = cum_bench_mo.iloc[-1]
vol_mo = mo_series.std() * np.sqrt(252)
sharpe_mo = (mo_series.mean() * 252) / (mo_series.std() * np.sqrt(252))

print("\n" + "="*60)
print("TABLE 4: FINAL COMPARATIVE REPORT (Step 3d)")
print("="*60)
print(f"Multi-Output Strategy Return:  {total_mo:.2%}")
print(f"Single-Output Strategy Return: {total_s2:.2%}")
print(f"Benchmark Return:              {total_bn:.2%}")
print("-" * 60)
print(f"MO Strategy Volatility:        {vol_mo:.2%}")
print(f"MO Strategy Sharpe Ratio:      {sharpe_mo:.2f}")
print("="*60)
Preparing High-Performance Data Structures...
Input Shape: (1240, 60, 15) (Samples, Window, Features)
Output Shape: (1240, 5) (Samples, 5 Assets)

================================================================================
TRAINING ENCODER-DECODER MULTI-OUTPUT MODEL
================================================================================

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
==================================================
TABLE 3: MULTI-OUTPUT MODEL METRICS
==================================================
  Asset  Train RMSE  Val RMSE  Test RMSE  Test MAE
0   SPY     0.03148   0.02847    0.06085   0.04211
1   TLT     0.03818   0.02528    0.05296   0.04027
2   SHY     0.00205   0.00238    0.00547   0.00380
3   GLD     0.05357   0.03726    0.04692   0.03783
4   DBO     0.10478   0.11528    0.14396   0.11855
--------------------------------------------------

================================================================================
MULTI-OUTPUT STRATEGY BACKTEST (3c & 3d)
================================================================================

No description has been provided for this image
============================================================
TABLE 4: FINAL COMPARATIVE REPORT (Step 3d)
============================================================
Multi-Output Strategy Return:  -12.36%
Single-Output Strategy Return: 22.32%
Benchmark Return:              19.38%
------------------------------------------------------------
MO Strategy Volatility:        10.74%
MO Strategy Sharpe Ratio:      -0.27
============================================================

Results¶

Results interpretation

Adding complexity didn't help. It actually made things worse. If you look at the SPY and TLT fits, the predictions are incredibly conservative—they hug the zero line and ignore real volatility. You see the same "dampening" in DBO, where the model ignored the 2020 oil crash entirely, giving us the highest error rate yet (RMSE=0.14396).

The loss curves explain it. Training loss went down, but validation loss was flat from the start. The model basically gave up on learning patterns and settled for a "safe" average.

This killed the returns. The Multi-Output model was the worst performer, losing -12.36%. Ironically, the simpler Single-Output strategy from Step 2 did the best (22.32%), beating both the complex model and the benchmark. Forcing the model to learn everything at once just introduced noise.

This matches the Efficient Market Hypothesis—prices move too fast to be predicted easily. Gu, Kelly, and Xiu noted the same thing: while ML helps sometimes, out-of-sample prediction is often limited and unstable.

Step 4¶

So, why did the single-output (Step 2) and multi-output (Step 3) models behave so differently?

  • Predictability Implications: The single-output models were "noisier" because they were trained in isolation. Each one focused only on minimizing error for its specific asset, so it could overfit to that asset's specific noise. The multi-output model did the opposite—it reverted to the mean. This happens because of "Multitask Learning" (MTL). MTL is supposed to help, but if the tasks aren't related enough, or if one task is super noisy, it fails (Caruana 43). The crazy volatility in Crude Oil likely messed up the gradients for safer assets like SHY, causing the network to play it safe and predict near-zero returns for everyone.

  • Backtesting Performance: The single-output model returned 22.32%, while the multi-output model lost -12.36%. This proves that more complexity doesn't mean more money. The single-output models were imperfect, but they caught enough directional signal to profit. The multi-output model smoothed everything out so much it missed the trends. LSTMs are sensitive, and as Fischer and Krauss (2018) noted, complexity doesn't guarantee results.

  • Is the Same Information Being Captured? No. Single-Output models captured idiosyncratic risk—the specific behavior of that one ticker. The Multi-Output model tried to capture systemic risk and correlations. But for this specific dataset, the idiosyncratic dynamics were more profitable. We likely saw "negative transfer," where learning one task hurts another, because the shared weights couldn't handle safe assets and risky assets at the same time.

Conclusion¶

This project taught us a hard lesson about complexity. Making the model fancier doesn't automatically make it a better investor. We built both single and multi-output LSTMs, and the results were totally different.

The simpler, individual models (Step 2) actually worked better. Because they focused on one asset at a time, they picked up enough signal to return 22.32%, beating the benchmark. The advanced multi-output model (Step 3) failed, losing -12.36%. By trying to minimize error across five different assets, it "averaged out" the signals and missed the opportunities.

In the end, while LSTMs can find non-linear patterns, our results show it's more profitable to model the unique dynamics of each asset rather than trying to model the whole system at once.

References¶

Basu, Devraj, and Joëlle Miffre. “Capturing the Risk Premium of Commodity Futures: The Role of Hedging Pressure.” Journal of Banking & Finance, vol. 37, no. 7, 2013, pp. 2652–64.

Fama, Eugene F. “Efficient Capital Markets: A Review of Theory and Empirical Work.” The Journal of Finance, vol. 25, no. 2, 1970, pp. 383–417.

Caruana, Rich. "Multitask Learning." Machine Learning, vol. 28, 1997, pp. 41–75.

Fischer, Thomas, and Christopher Krauss. "Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions." European Journal of Operational Research, vol. 270, no. 2, 2018, pp. 654–69.

Zhang, Aston, et al. Dive into Deep Learning. Cambridge University Press, 2023.

Lopez de Prado, Marcos. Advances in Financial Machine Learning. John Wiley & Sons, 2018.

Harvey, Campbell R., and Yan Liu. “Lucky Factors.” Journal of Financial Economics, vol. 141, no. 2, 2021, pp. 413–35.

Campbell, John Y., Andrew W. Lo, and A. Craig MacKinlay. The Econometrics of Financial Markets. Princeton University Press, 1997.

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. “Empirical Asset Pricing via Machine Learning.” Review of Financial Studies, vol. 33, no. 5, 2020, pp. 2223–2273.