Introduction¶
From this assignment, we are going to learn about the time series prediction in financial market through deep learning models which will focus on information leakage and backtest overfitting. Financial time series data are noisy, non-stationary, and highly sensitive to modeling assumptions which this leads to the predictive performance observed in backtest can often be misleading if the experimental design is not carefully controlled.
We focus on Bitcoin (BTC) and apply its daily price data from January 1, 2022 to December 1, 2025 for this assignment. From this data, we construct forecasting targets such as returns and volatility. We start by deliberately introducing label leakage into the modeling process to show how poorly designed training and validation splits can produce unrealistically strong backtesting results. Using this intentionally flawed setup, we then train and compare three deep learning models that are commonly applied in financial forecasting: a Multilayer Perceptron (MLP), a Long Short-Term Memory (LSTM) network, and a Convolutional Neural Network (CNN) based on Gramian Angular Fields (GAF).
We begin with a straightforward approach, using only one split between the training and test sets, and gradually progress towards more realistic walk-forward backtesting procedures. In this way, we get to witness the performance changes of our models as the validation environment keeps coming closer to a realistic testing environment. As we progress into the next part of the analysis, we try the experimental setup and assess the models with the new environment which allows us to observed the strategy performance between the actual predictive power versus overfitting due to incorrect validation settings.
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf
from scipy import stats
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error, r2_score
Step 1¶
Before fitting the model we started by descritive statistics. As illustrated in Figure 1, the raw price history is highly non-stationary, displaying distinct trends that complicate direct prediction. To resolve this, the data was converted into log returns (Figure 3). This transformation was essential to detrend the series and center the mean at zero, creating a stable input for the neural networks.
Furthermore, a comparison of the Autocorrelation Function (ACF) confirms the necessity of this step. While the raw prices exhibit persistent serial correlation—a hallmark of non-stationarity—the ACF of the log returns decays rapidly, confirming that the transformed series is statistically sound for training.
#step 1
np.random.seed(42)
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (14, 5)
# ---------------------------------------------------------
# PART (a): DATA ACQUISITION & DESCRIPTION
# ---------------------------------------------------------
ticker = "BTC-USD"
start_date = "2022-01-01"
end_date = "2025-12-01"
data = yf.download(
ticker,
start=start_date,
end=end_date,
auto_adjust=True,
progress=False
)
prices = data["Close"].squeeze().dropna()
print(f"{ticker}: {len(prices)} daily observations")
plt.plot(prices, color="navy", linewidth=1.5)
plt.title("Figure 1: Bitcoin Daily Prices (USD)", fontweight="bold")
plt.ylabel("Price")
plt.xlabel("Date")
plt.tight_layout()
plt.show()
plot_acf(prices, lags=30)
plt.title("Figure 2: ACF of Bitcoin Returns")
plt.tight_layout()
plt.show()
print("\nTable 1. Statistical Summary – Price Levels")
print("-" * 50)
print(f"Mean: {prices.mean():.2f}")
print(f"Std Dev: {prices.std():.2f}")
print(f"Skewness: {stats.skew(prices):.2f}")
print(f"Kurtosis: {stats.kurtosis(prices):.2f}")
log_returns = np.log(prices / prices.shift(1)).dropna()
plt.plot(log_returns, color="orange", linewidth=1)
plt.title("Figure 3: Bitcoin Daily Log Returns", fontweight="bold")
plt.ylabel("Log Return")
plt.tight_layout()
plt.show()
plot_acf(log_returns, lags=30)
plt.title("Figure 4: ACF of Bitcoin Log Returns")
plt.tight_layout()
plt.show()
print("\nTable 2. Statistical Summary – Log Returns")
print("-" * 50)
print(f"Mean: {log_returns.mean():.5f}")
print(f"Std Dev: {log_returns.std():.5f}")
print(f"Skewness: {stats.skew(log_returns):.2f}")
print(f"Kurtosis: {stats.kurtosis(log_returns):.2f}")
BTC-USD: 1430 daily observations
Table 1. Statistical Summary – Price Levels -------------------------------------------------- Mean: 55462.46 Std Dev: 32323.60 Skewness: 0.57 Kurtosis: -1.07
Table 2. Statistical Summary – Log Returns -------------------------------------------------- Mean: 0.00045 Std Dev: 0.02711 Skewness: -0.16 Kurtosis: 4.45
The following is the plot of the closing prices of BTCUSD. The price series shows strong non stationary with noticeable of uptrend and high volatile fluctuation. The ACF plot reveals a slow decay in the auto-correlation values and suggests that prices depend on the levels at which prices are dependent. The statistical summary of the prices in Table 1 reveals extreme values, a skewness, and a non-Gaussian distribution, which are reasons enough to transform prices into returns before employing the models.
In Figure 3, log returns of Bitcoin presented a daily cycle with values around zero and strong clustering of volatility instead of persistent trends. Compared to price distributions, log returns seem to be more stable. ACF plot from the Figure 4 conclude that the autocorrelations tend to decrease to almost zero rapidly which indicates weak linear correlations.
#Part B-- Converting data to Leaky Data
#Build Leaky Label
HORIZON = 5 # 5-day forward average return
y = (
log_returns
.shift(-1)
.rolling(window=HORIZON)
.mean()
)
#Build lagged feature
N_LAGS = 20
X = pd.concat(
[log_returns.shift(i) for i in range(N_LAGS)],
axis=1
)
X.columns = [f"lag_{i}" for i in range(N_LAGS)]
data = pd.concat([X, y.rename("target")], axis=1).dropna()
X = data.drop(columns="target")
y = data["target"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # <-- INFORMATION LEAKAGE
#Train Test Split
split = int(0.7 * len(X_scaled))
X_train = X_scaled[:split]
X_test = X_scaled[split:]
y_train = y.values[:split]
y_test = y.values[split:]
#Data Summary
print("Dataset Summary")
print("-" * 40)
print("Total observations:", len(X_scaled))
print("Train size:", len(X_train))
print("Test size:", len(X_test))
Dataset Summary ---------------------------------------- Total observations: 1409 Train size: 986 Test size: 423
Let $P_t$ denote the daily closing price of Bitcoin at time $t$.
The log return series is computed as:
$$ r_t = \ln\left(\frac{P_t}{P_{t-1}}\right), $$
which removes trends in price levels and yields a series that is closer to stationarity, making it suitable for predictive modeling.
The prediction target is deliberately defined as the average of future log returns over a fixed horizon $H = 5$: $ y_t = \frac{1}{H} \sum_{i=1}^{H} r_{t+i}. $
This forward-looking label construction introduces overlap across adjacent observations, causing future return information to be shared between neighboring samples and thereby creating information leakage.
The feature set consists of lagged log returns over the previous 20 trading days:
$ \mathbf{X}_t = \left( r_t, r_{t-1}, \dots, r_{t-19} \right). $
All variables are normalized over the entire data set before splitting the data into the train and test sets. We notice that all normalization parameters are calculated using both the train and the test sets which extra added up to the information leakage. Then follows the split test with a split along the time breakdown with no removal of duplicate observations between the split sets.
#Part C-- Building model and prediction
#MLP
def build_mlp(input_dim):
model = Sequential([
Dense(128, activation="relu", input_shape=(input_dim,)),
Dense(64, activation="relu"),
Dense(1)
])
model.compile(optimizer=Adam(0.001), loss="mse")
return model
#LSTM
def build_lstm(timesteps):
model = Sequential([
LSTM(64, input_shape=(timesteps, 1)),
Dense(32, activation="relu"),
Dense(1)
])
model.compile(optimizer=Adam(0.001), loss="mse")
return model
#CNN with GAF
def gramian_angular_field(ts):
ts = (ts - ts.min()) / (ts.max() - ts.min())
ts = np.clip(ts, 0, 1)
phi = np.arccos(ts)
return np.cos(phi[:, None] + phi[None, :])
def build_gaf_images(X):
images = [gramian_angular_field(row) for row in X]
return np.array(images)[..., np.newaxis]
def build_cnn_gaf(input_shape):
model = Sequential([
Conv2D(32, (3,3), activation="relu", input_shape=input_shape),
MaxPooling2D((2,2)),
Conv2D(64, (3,3), activation="relu"),
Flatten(),
Dense(64, activation="relu"),
Dense(1)
])
model.compile(optimizer=Adam(0.001), loss="mse")
return model
# LSTM reshaping
X_train_lstm = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test_lstm = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
# CNN-GAF transformation
X_train_gaf = build_gaf_images(X_train)
X_test_gaf = build_gaf_images(X_test)
es = EarlyStopping(patience=10, restore_best_weights=True)
# Initialize models
mlp = build_mlp(X_train.shape[1])
lstm = build_lstm(X_train_lstm.shape[1])
cnn = build_cnn_gaf(X_train_gaf.shape[1:])
# Train
history_mlp = mlp.fit(
X_train, y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[es],
verbose=0
)
history_lstm = lstm.fit(
X_train_lstm, y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[es],
verbose=0
)
history_cnn = cnn.fit(
X_train_gaf, y_train,
validation_split=0.2,
epochs=50,
batch_size=32,
callbacks=[es],
verbose=0
)
y_pred_mlp = mlp.predict(X_test).flatten()
y_pred_lstm = lstm.predict(X_test_lstm).flatten()
y_pred_cnn = cnn.predict(X_test_gaf).flatten()
/usr/local/lib/python3.12/dist-packages/keras/src/layers/core/dense.py:93: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step 14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 15ms/step
def report(name, y_true, y_pred):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)
print(f"{name:<8} | RMSE: {rmse:.6f} | R²: {r2:.4f}")
print("\nPART (c): Prediction Performance (Leaky Setup)")
print("-" * 50)
report("MLP", y_test, y_pred_mlp)
report("LSTM", y_test, y_pred_lstm)
report("CNN-GAF", y_test, y_pred_cnn)
PART (c): Prediction Performance (Leaky Setup) -------------------------------------------------- MLP | RMSE: 0.049557 | R²: -23.9211 LSTM | RMSE: 0.006783 | R²: 0.5331 CNN-GAF | RMSE: 0.008722 | R²: 0.2281
results = [
{
"Model": "MLP",
"y_true": y_test,
"y_pred": y_pred_mlp,
"hist": history_mlp.history,
"r2": r2_score(y_test, y_pred_mlp)
},
{
"Model": "LSTM",
"y_true": y_test,
"y_pred": y_pred_lstm,
"hist": history_lstm.history,
"r2": r2_score(y_test, y_pred_lstm)
},
{
"Model": "CNN-GAF",
"y_true": y_test,
"y_pred": y_pred_cnn,
"hist": history_cnn.history,
"r2": r2_score(y_test, y_pred_cnn)
}
]
for res in results:
fig, axes = plt.subplots(1, 3, figsize=(24, 6))
# --------------------------------------------------
# Plot 1: Observed vs Predicted (Time Series)
# --------------------------------------------------
ax1 = axes[0]
zoom = 200
y_true = res["y_true"][-zoom:]
y_pred = res["y_pred"][-zoom:]
idx = np.arange(len(y_true))
ax1.plot(
idx, y_true,
color="black", linewidth=2.0,
label="Observed"
)
ax1.plot(
idx, y_pred,
color="crimson", linestyle="--", linewidth=2.0,
label="Predicted"
)
ax1.set_title(
f"{res['Model']} — Observed vs Predicted Forward Returns",
fontsize=14, fontweight="bold"
)
ax1.set_xlabel("Test Time Index")
ax1.set_ylabel("5-Day Forward Return")
ax1.legend()
ax1.grid(True, alpha=0.3)
# --------------------------------------------------
# Plot 2: Training & Validation Loss (Convergence)
# --------------------------------------------------
ax2 = axes[1]
hist = res["hist"]
ax2.plot(
hist["loss"],
color="steelblue", linewidth=2.0,
label="Training Loss"
)
ax2.plot(
hist["val_loss"],
color="darkorange", linewidth=2.0,
linestyle=":",
label="Validation Loss"
)
ax2.set_title(
"Training Convergence (MSE Loss)",
fontsize=14, fontweight="bold"
)
ax2.set_xlabel("Epochs")
ax2.set_ylabel("Loss")
ax2.legend()
ax2.grid(True, linestyle="--", alpha=0.4)
# --------------------------------------------------
# Plot 3: Actual vs Predicted (Scatter)
# --------------------------------------------------
ax3 = axes[2]
ax3.scatter(
res["y_true"],
res["y_pred"],
alpha=0.5,
color="cyan",
edgecolors="black"
)
min_val = min(res["y_true"].min(), res["y_pred"].min())
max_val = max(res["y_true"].max(), res["y_pred"].max())
ax3.plot(
[min_val, max_val],
[min_val, max_val],
"k--", lw=2,
label="Perfect Fit"
)
ax3.set_title(
f"Prediction Accuracy (R² = {res['r2']:.4f})",
fontsize=14, fontweight="bold"
)
ax3.set_xlabel("Observed Returns")
ax3.set_ylabel("Predicted Returns")
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The LSTM gave a $R^2$ of 0.5331, which is a number that seems too high for predicting daily returns. The time series plot makes it clear that this is accurate because the predicted red line almost perfectly follows the actual market data. Check the scatter plot: the points are practically glued to the diagonal. That confirms the high accuracy. The CNN-GAF also put up a decent number, coming in with an $R^2$ of 0.2281. The MLP, however, was a total disaster. It didn't converge. The training loss tanked immediately, but validation loss stayed stuck at a high level. The model was overwhelmed and couldn't learn the pattern.
Step d¶
#Part D
def backtest_strategy(y_true, y_pred, model_name):
"""
Long–Short and Long-Only backtest for regression-based signals
"""
df = pd.DataFrame({
"Ret": y_true,
"Pred": y_pred
}).dropna()
# --------------------------------------------------
# Trading Signals (Regression-Based)
# --------------------------------------------------
df["Position"] = np.where(df["Pred"] > 0, 1, -1)
# Shift positions to avoid look-ahead bias
df["Position_shift"] = df["Position"].shift(1)
# Long–Short strategy
df["Strat_ret"] = df["Position_shift"] * df["Ret"]
# Long-Only strategy
df["Position_L"] = df["Position_shift"].clip(lower=0)
df["Strat_ret_L"] = df["Position_L"] * df["Ret"]
# --------------------------------------------------
# Cumulative Returns
# --------------------------------------------------
df["Cum_LS"] = (1 + df["Strat_ret"]).cumprod() - 1
df["Cum_L"] = (1 + df["Strat_ret_L"]).cumprod() - 1
df["Cum_BH"] = (1 + df["Ret"]).cumprod() - 1
# --------------------------------------------------
# Performance Metrics
# --------------------------------------------------
results = {
"Model": model_name,
"Final_Return_LS": (1 + df["Strat_ret"]).prod() - 1,
"Final_Return_L": (1 + df["Strat_ret_L"]).prod() - 1,
"Final_Return_BH": (1 + df["Ret"]).prod() - 1,
"Sharpe_LS": np.mean(df["Strat_ret"]) / np.std(df["Strat_ret"]) * np.sqrt(252),
"Sharpe_L": np.mean(df["Strat_ret_L"]) / np.std(df["Strat_ret_L"]) * np.sqrt(252),
"Data": df
}
return results
bt_mlp = backtest_strategy(y_test, y_pred_mlp, "MLP")
bt_lstm = backtest_strategy(y_test, y_pred_lstm, "LSTM")
bt_cnn = backtest_strategy(y_test, y_pred_cnn, "CNN-GAF")
backtests = [bt_mlp, bt_lstm, bt_cnn]
summary_df = pd.DataFrame([
{
"Model": bt["Model"],
"Long–Short Return (%)": bt["Final_Return_LS"] * 100,
"Long-Only Return (%)": bt["Final_Return_L"] * 100,
"Buy & Hold Return (%)": bt["Final_Return_BH"] * 100,
"Sharpe (LS)": bt["Sharpe_LS"],
"Sharpe (L)": bt["Sharpe_L"]
}
for bt in backtests
])
print("\nPART (d): BACKTEST PERFORMANCE SUMMARY")
print("-" * 60)
print(summary_df.round(3).to_string(index=False))
print("-" * 60)
plt.figure(figsize=(14, 6))
for bt in backtests:
plt.plot(bt["Data"]["Cum_LS"], label=f"{bt['Model']} (Long–Short)")
plt.plot(backtests[0]["Data"]["Cum_BH"],
color="black", linestyle="--", linewidth=2,
label="Buy & Hold")
plt.title("Figure 5: Equity Curves — Long–Short Strategies",
fontsize=14, fontweight="bold")
plt.ylabel("Cumulative Return")
plt.xlabel("Test Time Index")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
plt.figure(figsize=(14, 6))
for bt in backtests:
plt.plot(bt["Data"]["Cum_L"], label=f"{bt['Model']} (Long-Only)")
plt.plot(backtests[0]["Data"]["Cum_BH"],
color="black", linestyle="--", linewidth=2,
label="Buy & Hold")
plt.title("Figure 6: Equity Curves — Long-Only Strategies",
fontsize=14, fontweight="bold")
plt.ylabel("Cumulative Return")
plt.xlabel("Test Time Index")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
PART (d): BACKTEST PERFORMANCE SUMMARY
------------------------------------------------------------
Model Long–Short Return (%) Long-Only Return (%) Buy & Hold Return (%) Sharpe (LS) Sharpe (L)
MLP 53.445 50.582 42.923 1.705 2.328
LSTM 581.519 216.856 42.923 8.255 6.418
CNN-GAF 524.726 202.990 42.923 7.793 5.790
------------------------------------------------------------
Summary for Part (c) and (d)¶
There are definite differences in predictive ability among models in leveraging this temporal structure in a purposefully leaky environment. The MLP model does not work well, having a strongly negative $R^2$ measure, suggesting that it does not learn any meaningful temporal structure in the series of forward returns and even underperforms a naive model. On the other hand, the LSTM model has the strongest predictive ability ($R^2\approx 0.59$), strongly tracking the series of observed forward returns in a forward-looking way. The CNN-GAF model performs decently ($R^2\approx 0.30$), capturing structural patterns in the data, though it is less effective than the LSTM.
These predictive differences carry over directly to trading results. In the backtesting results, the strategies based on the predictions of LSTM clearly outperform the CNN-GAF strategy and the MLP strategy as well as the buy-and-hold strategy for Bitcoin. The long-short strategy based on LSTM has the highest return and Sharpe ratio, seconded by the CNN-GAF strategy, while the MLP strategy only has a relatively small profit. Similar observations can be made when looking at the results of the long-only strategies.
Generally, the above results demonstrate how better models at representing temporal information are capable of producing very optimistic forecast and trading results when information leakage is present. This is an affirmation of the purpose behind the exercise, which is how models with improper label construction and handling are prone to overestimate model ability at both overfitting and the corresponding back-test.
Step 2¶
To solve this, we implemented a sliding window approach. Unlike an "anchored" or expanding window, which retains all history, our method uses a fixed-size training window that rolls forward in time. This ensures that the model is always trained on the most recent market dynamics while discarding outdated information.
The validation process was executed in two specific configurations to test signal degradation:
Long Horizon: We utilized a training window of $N_{train}=500$ observations followed by a testing window of $N_{test}=500$. To tests the model's ability to hold a predictive edge over a longer period without retraining.
Short Horizon: We used $N_{train}=500$ window but shortened the testing window to $N_{test}=100$. This simulates a more active strategy where the model is retrained frequently to adapt to new price levels.
We retained the data leakage introduced in Step 1. The matrix $X$ passed to the walk-forward function was pre-processed using StandardScaler fitted on the entire dataset ($t_0$ to $t_N$).
We standardized the inputs using the formula $z_t = \frac{x_t - \mu_{global}}{\sigma_{global}}$. This method creates a direct path for leakage. Because the global mean ($\mu_{global}$) and standard deviation ($\sigma_{global}$) include data from the test set, the model implicitly learns the future volatility profile before making a single prediction.
# STEP 2:
def build_lstm(timesteps):
model = Sequential([
LSTM(64, input_shape=(timesteps, 1)),
Dense(32, activation='relu'),
Dense(1)
])
model.compile(optimizer=Adam(0.001), loss='mse')
return model
def build_cnn_gaf(input_shape):
model = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=input_shape),
MaxPooling2D((2,2)),
Conv2D(64, (3,3), activation='relu'),
Flatten(),
Dense(64, activation='relu'),
Dense(1)
])
model.compile(optimizer=Adam(0.001), loss='mse')
return model
def gramian_angular_field(ts):
ts = (ts - ts.min()) / (ts.max() - ts.min())
ts = np.clip(ts, 0, 1)
phi = np.arccos(ts)
return np.cos(phi[:, None] + phi[None, :])
def build_gaf_images(X):
images = [gramian_angular_field(row) for row in X]
return np.array(images)[..., np.newaxis]
# Walk-forward backtest function
def walk_forward_backtest(X, y, train_len, test_len, model_type='lstm'):
"""
Non-anchored walk-forward backtest.
X, y: full dataset (scaled features and target)
train_len: length of training window
test_len: length of testing window
model_type: 'lstm' or 'cnn_gaf'
"""
n = len(X)
preds = []
actuals = []
for start in range(0, n - train_len - test_len + 1, test_len):
train_end = start + train_len
test_end = train_end + test_len
X_train = X[start:train_end]
y_train = y[start:train_end]
X_test = X[train_end:test_end]
y_test = y[train_end:test_end]
if model_type == 'lstm':
X_train_rs = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test_rs = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
model = build_lstm(X_train_rs.shape[1])
model.fit(X_train_rs, y_train, epochs=30, batch_size=32, verbose=0)
y_pred = model.predict(X_test_rs, verbose=0).flatten()
elif model_type == 'cnn_gaf':
X_train_rs = build_gaf_images(X_train)
X_test_rs = build_gaf_images(X_test)
model = build_cnn_gaf(X_train_rs.shape[1:])
model.fit(X_train_rs, y_train, epochs=30, batch_size=32, verbose=0)
y_pred = model.predict(X_test_rs, verbose=0).flatten()
preds.extend(y_pred)
actuals.extend(y_test)
return np.array(actuals), np.array(preds)
# Prepare data (same as Step 1)
N_LAGS = 20
HORIZON = 5
log_returns = np.log(prices / prices.shift(1)).dropna()
y = log_returns.shift(-HORIZON).rolling(HORIZON).mean().dropna()
X = pd.concat([log_returns.shift(i) for i in range(N_LAGS)], axis=1).dropna()
X.columns = [f'lag_{i}' for i in range(N_LAGS)]
df = pd.concat([X, y.rename('target')], axis=1).dropna()
X = df.drop(columns='target')
y = df['target'].values
# Scale features (WITH leakage, as per Step 1 setup)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2a: Walk-forward with train=500, test=500
print("="*60)
print("STEP 2a: Walk-forward (train=500, test=500)")
print("="*60)
actuals_2a_lstm, preds_2a_lstm = walk_forward_backtest(X_scaled, y, train_len=500, test_len=500, model_type='lstm')
actuals_2a_cnn, preds_2a_cnn = walk_forward_backtest(X_scaled, y, train_len=500, test_len=500, model_type='cnn_gaf')
# Step 2b: Walk-forward with train=500, test=100
print("\n" + "="*60)
print("STEP 2b: Walk-forward (train=500, test=100)")
print("="*60)
actuals_2b_lstm, preds_2b_lstm = walk_forward_backtest(X_scaled, y, train_len=500, test_len=100, model_type='lstm')
actuals_2b_cnn, preds_2b_cnn = walk_forward_backtest(X_scaled, y, train_len=500, test_len=100, model_type='cnn_gaf')
# Performance metrics function
def evaluate_performance(actuals, preds, model_name, step_label):
rmse = np.sqrt(np.mean((actuals - preds)**2))
r2 = 1 - np.sum((actuals - preds)**2) / np.sum((actuals - np.mean(actuals))**2)
print(f"{model_name} ({step_label}): RMSE = {rmse:.6f}, R² = {r2:.4f}")
return rmse, r2
print("\nPerformance Summary:")
print("-"*50)
rmse_2a_lstm, r2_2a_lstm = evaluate_performance(actuals_2a_lstm, preds_2a_lstm, "LSTM", "2a")
rmse_2a_cnn, r2_2a_cnn = evaluate_performance(actuals_2a_cnn, preds_2a_cnn, "CNN-GAF", "2a")
rmse_2b_lstm, r2_2b_lstm = evaluate_performance(actuals_2b_lstm, preds_2b_lstm, "LSTM", "2b")
rmse_2b_cnn, r2_2b_cnn = evaluate_performance(actuals_2b_cnn, preds_2b_cnn, "CNN-GAF", "2b")
# Backtest strategy function (reuse from Step 1)
def backtest_strategy(y_true, y_pred, model_name):
df = pd.DataFrame({'Ret': y_true, 'Pred': y_pred})
df['Position'] = np.where(df['Pred'] > 0, 1, -1)
df['Position_shift'] = df['Position'].shift(1)
df['Strat_ret'] = df['Position_shift'] * df['Ret']
df['Cum_LS'] = (1 + df['Strat_ret']).cumprod() - 1
sharpe = np.mean(df['Strat_ret']) / np.std(df['Strat_ret']) * np.sqrt(252) if np.std(df['Strat_ret']) > 0 else 0
final_return = (1 + df['Strat_ret']).prod() - 1
return {
'Model': model_name,
'Final_Return': final_return,
'Sharpe': sharpe,
'Data': df
}
# Run backtests for Step 2 results
bt_2a_lstm = backtest_strategy(actuals_2a_lstm, preds_2a_lstm, "LSTM (2a)")
bt_2a_cnn = backtest_strategy(actuals_2a_cnn, preds_2a_cnn, "CNN-GAF (2a)")
bt_2b_lstm = backtest_strategy(actuals_2b_lstm, preds_2b_lstm, "LSTM (2b)")
bt_2b_cnn = backtest_strategy(actuals_2b_cnn, preds_2b_cnn, "CNN-GAF (2b)")
# Plot equity curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for idx, (bt, title) in enumerate([(bt_2a_lstm, "LSTM 2a"), (bt_2a_cnn, "CNN-GAF 2a"),
(bt_2b_lstm, "LSTM 2b"), (bt_2b_cnn, "CNN-GAF 2b")]):
ax = axes[idx//2, idx%2]
ax.plot(bt['Data']['Cum_LS'], label='Long-Short')
ax.set_title(f"{title} - Equity Curve", fontweight='bold')
ax.set_ylabel("Cumulative Return")
ax.set_xlabel("Step")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Summary table
summary = pd.DataFrame([
{'Model': 'LSTM', 'Step': '2a', 'RMSE': rmse_2a_lstm, 'R2': r2_2a_lstm, 'Sharpe': bt_2a_lstm['Sharpe'], 'Final_Return': bt_2a_lstm['Final_Return']},
{'Model': 'CNN-GAF', 'Step': '2a', 'RMSE': rmse_2a_cnn, 'R2': r2_2a_cnn, 'Sharpe': bt_2a_cnn['Sharpe'], 'Final_Return': bt_2a_cnn['Final_Return']},
{'Model': 'LSTM', 'Step': '2b', 'RMSE': rmse_2b_lstm, 'R2': r2_2b_lstm, 'Sharpe': bt_2b_lstm['Sharpe'], 'Final_Return': bt_2b_lstm['Final_Return']},
{'Model': 'CNN-GAF', 'Step': '2b', 'RMSE': rmse_2b_cnn, 'R2': r2_2b_cnn, 'Sharpe': bt_2b_cnn['Sharpe'], 'Final_Return': bt_2b_cnn['Final_Return']}
])
print("\n" + "="*60)
print("STEP 2 - PERFORMANCE SUMMARY TABLE")
print("="*60)
print(summary.round(4).to_string(index=False))
============================================================ STEP 2a: Walk-forward (train=500, test=500) ============================================================
/usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
============================================================ STEP 2b: Walk-forward (train=500, test=100) ============================================================
WARNING:tensorflow:5 out of the last 21 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x7c53f2210220> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:6 out of the last 24 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x7c53f2210220> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Performance Summary: -------------------------------------------------- LSTM (2a): RMSE = 0.012118, R² = -0.2133 CNN-GAF (2a): RMSE = 0.014764, R² = -0.8010 LSTM (2b): RMSE = 0.012087, R² = -0.3099 CNN-GAF (2b): RMSE = 0.012618, R² = -0.4274
============================================================ STEP 2 - PERFORMANCE SUMMARY TABLE ============================================================ Model Step RMSE R2 Sharpe Final_Return LSTM 2a 0.0121 -0.2133 -0.7227 -0.2478 CNN-GAF 2a 0.0148 -0.8010 -0.2058 -0.0983 LSTM 2b 0.0121 -0.3099 -0.3868 -0.2473 CNN-GAF 2b 0.0126 -0.4274 0.1422 0.0354
As shown in the Performance Summary Table, the LSTM (2a) dropped from a positive $R^2$ of ~0.53 in Step 1 to -0.2133, and the CNN-GAF (2a) fared even worse at -0.8010. This confirms that the initial success was almost entirely driven by the model's ability to "memorize" the global static split. When forced to adapt to new data in 500-day increments, the models failed to beat a simple horizontal line (mean prediction).
In Figure, the LSTM 2a curve (top left) shows a steady downward drift, losing nearly 25% of its value by the end of the backtest. This stands in stark contrast to the perfect upward 45-degree angle seen in the Step 1 equity curves.
The CNN-GAF model actually improved slightly when we shortened the test window to 100 days (Step 2b). While it still lost money in the 500/500 split (Return: -9.83%), it managed a small positive return of 3.54% and a positive Sharpe of 0.14 in the 500/100 split (Step 2b).
This suggests that the image-based GAF approach might capture very short-term momentum signals that decay quickly. If you try to trade them over 500 days (2a), they fail; if you retrain every 100 days (2b), you capture a fraction of the value.
Step 3¶
To address the issue of look-ahead bias identified in the previous stages, we restructured the data normalization pipeline for Step 3. In the earlier iterations, the application of a global scaler across the entire dataset allowed the models to inadvertently access future statistical properties, specifically the mean and variance of the test set before the prediction phase.
This violation of causality is a primary driver of inflated backtest performance and must be eliminated to evaluate the true utility of the neural networks. Consequently, we implemented a strict dynamic normalization embedded directly within the Walk-Forward Validation loop.
The core of this solution involves initializing a completely new instance of the standard scaler for every single iteration of the rolling window. By doing so, we ensure that the standardization parameters are derived exclusively from the 500 observations available in the training set at that specific point in time.
The normalization of the subsequent test set is then performed using these frozen training statistics, following the equation $$z_{test} = \frac{x_{test} - \mu_{train}}{\sigma_{train}}.$$
This approach simulates a realistic live-trading environment where the model must contend with outliers or volatility shifts in the test data without prior knowledge of the future distribution. We applied this sanitized workflow to both the 500/500 and 500/100 split scenarios to determine if the models could retain any predictive edge once the artificial advantage of data leakage was stripped away.
# Step 3
log_returns = np.log(prices / prices.shift(1)).dropna()
y_raw = log_returns.shift(-HORIZON).rolling(HORIZON).mean().dropna()
X_raw = pd.concat([log_returns.shift(i) for i in range(N_LAGS)], axis=1).dropna()
X_raw.columns = [f'lag_{i}' for i in range(N_LAGS)]
df_clean = pd.concat([X_raw, y_raw.rename('target')], axis=1).dropna()
X_clean = df_clean.drop(columns='target').values
y_clean = df_clean['target'].values
print(f"Data reset for Step 3. Shape: {X_clean.shape}")
# Modified Walk-Forward Function to Return History
def walk_forward_no_leakage_with_history(X, y, train_len, test_len, model_type='lstm'):
"""
Performs non-anchored walk-forward backtest.
Returns: actuals, predictions, and the history of the LAST training fold.
"""
n = len(X)
preds = []
actuals = []
last_history = None
for start in range(0, n - train_len - test_len + 1, test_len):
train_end = start + train_len
test_end = train_end + test_len
# 1. Slice Raw Data
X_train_raw = X[start:train_end]
y_train = y[start:train_end]
X_test_raw = X[train_end:test_end]
y_test = y[train_end:test_end]
# Dynamic Normalization (No Leakage)
local_scaler = StandardScaler()
X_train_scaled = local_scaler.fit_transform(X_train_raw)
X_test_scaled = local_scaler.transform(X_test_raw)
# Model Training
if model_type == 'lstm':
X_train_rs = X_train_scaled.reshape(X_train_scaled.shape[0], X_train_scaled.shape[1], 1)
X_test_rs = X_test_scaled.reshape(X_test_scaled.shape[0], X_test_scaled.shape[1], 1)
# Re-build model to ensure clean slate
model = build_lstm(X_train_rs.shape[1])
# Capture history
history = model.fit(
X_train_rs, y_train,
validation_split=0.2,
epochs=30,
batch_size=32,
verbose=0
)
y_pred = model.predict(X_test_rs, verbose=0).flatten()
elif model_type == 'cnn_gaf':
X_train_rs = build_gaf_images(X_train_scaled)
X_test_rs = build_gaf_images(X_test_scaled)
model = build_cnn_gaf(X_train_rs.shape[1:])
history = model.fit(
X_train_rs, y_train,
validation_split=0.2,
epochs=30,
batch_size=32,
verbose=0
)
y_pred = model.predict(X_test_rs, verbose=0).flatten()
preds.extend(y_pred)
actuals.extend(y_test)
last_history = history # Save last history for plotting
return np.array(actuals), np.array(preds), last_history
# Run Backtests (Step 3b and 3c)
print("Running Step 3b (500/500)...")
y_3b_lstm, p_3b_lstm, h_3b_lstm = walk_forward_no_leakage_with_history(X_clean, y_clean, 500, 500, 'lstm')
y_3b_cnn, p_3b_cnn, h_3b_cnn = walk_forward_no_leakage_with_history(X_clean, y_clean, 500, 500, 'cnn_gaf')
print("Running Step 3c (500/100)...")
y_3c_lstm, p_3c_lstm, h_3c_lstm = walk_forward_no_leakage_with_history(X_clean, y_clean, 500, 100, 'lstm')
y_3c_cnn, p_3c_cnn, h_3c_cnn = walk_forward_no_leakage_with_history(X_clean, y_clean, 500, 100, 'cnn_gaf')
#
results_step3_plots = [
{
"Model": "LSTM (Step 3b: 500/500)",
"y_true": y_3b_lstm,
"y_pred": p_3b_lstm,
"hist": h_3b_lstm.history,
"r2": r2_score(y_3b_lstm, p_3b_lstm)
},
{
"Model": "CNN-GAF (Step 3b: 500/500)",
"y_true": y_3b_cnn,
"y_pred": p_3b_cnn,
"hist": h_3b_cnn.history,
"r2": r2_score(y_3b_cnn, p_3b_cnn)
},
{
"Model": "LSTM (Step 3c: 500/100)",
"y_true": y_3c_lstm,
"y_pred": p_3c_lstm,
"hist": h_3c_lstm.history,
"r2": r2_score(y_3c_lstm, p_3c_lstm)
},
{
"Model": "CNN-GAF (Step 3c: 500/100)",
"y_true": y_3c_cnn,
"y_pred": p_3c_cnn,
"hist": h_3c_cnn.history,
"r2": r2_score(y_3c_cnn, p_3c_cnn)
}
]
print("\nGenerating Loss and Performance Plots for Step 3...")
for res in results_step3_plots:
fig, axes = plt.subplots(1, 3, figsize=(24, 6))
#
ax1 = axes[0]
zoom = 150
y_true = res["y_true"][-zoom:]
y_pred = res["y_pred"][-zoom:]
idx = np.arange(len(y_true))
ax1.plot(idx, y_true, color="black", linewidth=2.0, label="Observed")
ax1.plot(idx, y_pred, color="crimson", linestyle="--", linewidth=2.0, label="Predicted")
ax1.set_title(f"{res['Model']} — Last {zoom} Days", fontsize=14, fontweight="bold")
ax1.set_xlabel("Test Time Index")
ax1.set_ylabel("5-Day Forward Return")
ax1.legend()
ax1.grid(True, alpha=0.3)
#
ax2 = axes[1]
hist = res["hist"]
ax2.plot(hist["loss"], color="steelblue", linewidth=2.0, label="Training Loss")
if "val_loss" in hist:
ax2.plot(hist["val_loss"], color="darkorange", linewidth=2.0, linestyle=":", label="Validation Loss")
ax2.set_title("Training Convergence (Last Fold)", fontsize=14, fontweight="bold")
ax2.set_xlabel("Epochs")
ax2.set_ylabel("MSE Loss")
ax2.legend()
ax2.grid(True, linestyle="--", alpha=0.4)
#
ax3 = axes[2]
ax3.scatter(res["y_true"], res["y_pred"], alpha=0.5, color="cyan", edgecolors="black")
min_val = min(res["y_true"].min(), res["y_pred"].min())
max_val = max(res["y_true"].max(), res["y_pred"].max())
ax3.plot([min_val, max_val], [min_val, max_val], "k--", lw=2, label="Perfect Fit")
ax3.set_title(f"Prediction Accuracy (R² = {res['r2']:.4f})", fontsize=14, fontweight="bold")
ax3.set_xlabel("Observed Returns")
ax3.set_ylabel("Predicted Returns")
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
#
metrics_summary = []
for res in results_step3_plots:
# Calculate simple Sharpe for the table
df_temp = pd.DataFrame({'ret': res['y_true'], 'pred': res['y_pred']})
df_temp['pos'] = np.where(df_temp['pred'] > 0, 1, -1)
df_temp['strat'] = df_temp['pos'].shift(1) * df_temp['ret']
sharpe = (df_temp['strat'].mean() / df_temp['strat'].std() * np.sqrt(252)) if df_temp['strat'].std() > 0 else 0
metrics_summary.append({
"Model": res['Model'],
"RMSE": np.sqrt(mean_squared_error(res['y_true'], res['y_pred'])),
"R2": res['r2'],
"Sharpe": sharpe
})
print("\nStep 3 Final Metrics Summary:")
print(pd.DataFrame(metrics_summary).round(4).to_string(index=False))
Data reset for Step 3. Shape: (1405, 20) Running Step 3b (500/500)...
/usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Running Step 3c (500/100)...
/usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs) /usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py:113: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Generating Loss and Performance Plots for Step 3...
Step 3 Final Metrics Summary:
Model RMSE R2 Sharpe
LSTM (Step 3b: 500/500) 0.0141 -0.6367 -3.4770
CNN-GAF (Step 3b: 500/500) 0.0143 -0.6817 0.6595
LSTM (Step 3c: 500/100) 0.0125 -0.4024 -0.9971
CNN-GAF (Step 3c: 500/100) 0.0138 -0.7189 -0.2966
Step 3d¶
Step 3 finally settles the overfitting debate. Once we switched to the dynamic normalization method—stripping away the global scaler—the performance didn't just drop; it collapsed.
The metrics summary paints a bleak picture. $R^2$ values across the board flipped to negative, landing between -0.40 and -0.71. In financial modeling, a negative $R^2$ is worse than useless—it means the model performed worse than a simple horizontal line predicting the average. This confirms that the high "alpha" we saw in Step 1 wasn't genuine. It was purely a function of leakage. The models were using future variance data to calibrate their predictions.
Visually, the failure is obvious. The scatter plots (like the LSTM in Step 3b) no longer track the diagonal; they show a shapeless cloud of predictions with zero correlation to reality. The convergence plots back this up: the Training Loss (blue line) tanks to near-zero, but the Validation Loss (orange dotted line) stays high and erratic.
There is one odd outlier: the CNN-GAF (Step 3b) posted a positive Sharpe Ratio (0.6595) despite a terrible R2 of -0.6817. Don't be fooled by this. In a market as volatile as Bitcoin, a bad model can get "lucky" on the direction of a few massive candles, creating a positive return even if its overall accuracy is garbage. This is a statistical artifact, not a valid signal.
Conclusion: The "apparent" overfitting from Step 2 is gone, but only because we removed the cheat code. It has been replaced by the harsh reality that, without look-ahead bias, these architectures struggle to find any signal in the daily Bitcoin random walk.
Conclusion¶
This study investigated the susceptibility of Deep Learning models specifically MLP, LSTM, and CNN-GAF architectures to look-ahead bias within financial time series forecasting. The primary objective was to quantify the performance discrepancy between a flawed validation environment (containing data leakage) and a rigorous, non-anchored walk-forward framework.
The results from Steps 1 and 2 provide a compelling demonstration of the risks associated with improper data normalization. In the initial phase, where global statistics were permitted to influence the training sets, the LSTM and CNN-GAF models exhibited statistically improbable predictive power, achieving Sharpe ratios exceeding 7.0 and $R^2$ values significantly above zero. These metrics, while attractive, were identified as methodological artifacts rather than genuine market signals. The models effectively memorized the future volatility regime inherent in the global scaler, allowing them to anticipate magnitude shifts in the Bitcoin price series.
The correction of this flaw was implamented in Step 3. Upon implementing a dynamic normalization whereby standardization parameters were isolated strictly within each training window the predictive performance collapsed. The shift to negative $R^2$ values across all models confirms that the alpha observed in the earlier stages was exclusively a product of information leakage.
Consequently, we highlights that standard Deep Learning architectures, when applied to raw daily returns without robust feature engineering, struggle to overcome the non-stationarity of cryptocurrency markets. The findings underscore the critical necessity of strictly causal validation pipelines; without them, sophisticated neural networks are prone to fitting the validation noise rather than the underlying market signal.
Reference¶
De Prado, Marcos Lopez. Advances in financial machine learning. John Wiley & Sons, 2018.