Introduction¶
In time series analysis, beyond simply studying market dynamics, economists are primarily interested in forecasting financial behavior for the days ahead. To achieve this, a variety of models have historically been employed. These range from standard Autoregressive (AR), Moving Average (MA), and ARMA/ARIMA models, to those that model variability alongside the mean, such as GARCH (for normal distributions) or GARMA and GLARMA (for distributions in the exponential family). However, these traditional models impose rigid structures and often fail to accurately capture the complex patterns found in real-world time series.
Consequently, models based on neural networks have gained traction due to their higher predictive power. These operate on the assumption that any unknown function can be approximated by a neural network composed of inputs, hidden layers, activation functions, weights, and error terms, similar to the structure of a basic Generalized Linear Models (Leshno et al., 1993; Funahashi, 1989). Beyond standard neural networks, there are Convolutional Neural Networks (CNNs). While typically famous for image analysis, Oates and Wang (2) suggest that we can leverage the power of CNNs for finance by transforming time series data into images using Gramian Angular Fields (GAF) to generate forecasts.
Regardless of the model chosen, accurate forecasting generally requires the data’s basic properties such as mean, variance, and autocorrelation to remain stable over time (Prado 78). This property, known as stationarity, is crucial for identifying the right model. Traditionally, stationarity is achieved through integer differentiation (like taking log returns). However, as Prado (78) argues, this brute-force approach often wipes out the series' memory, leaving behind nothing but noise. This necessitates methods like fractional differentiation, which achieves stationarity while still preserving the underlying patterns and signal of the original time series.
This project explores these concepts, demonstrating their implementation in Python, interpreting the results, and discussing the trade-offs of modeling time series both with and without differentiation across different Deep Learning architectures.
Step 1¶
Let's consider a financial asset price $Y(t)$, where $t$ can be discrete or continuous time. Financial price series are stochastic processes that exhibit behaviors like Brownian motion or a random walk, making them non-stationary.
In a non-stationary process, statistical properties such as the mean ($\mu$), variance ($\sigma^2$), and autocovariance are not constant but change over time (Prado 75). For reliable statistical inference and forecasting, however, it is best to work with stationary processes to ensures that our estimators are consistent and that standard asymptotic convergence theorems hold, making predictions valid. To make times series $Y(t)$ stationary, there are many tests, here we work with three of them:
Augmented Dickey Fuller (ADF) test¶
This test, test the null hypothesis ($H_0$) that our price $Y(t)$ as unit root, that's our price is not statationay. The test is based on regred the first difference ($\Delta Y_t =Y_t - Y_{t-1}$) of the times series $(Y_t=Y(t))$ by the drift ($\beta_0$), determinist time trend ($\beta t$), force of mean-reversion ($\gamma Y_{t-1}$, lagged level of times series ($\delta_1 \Delta Y_{t-1} + \dots + \delta_{p-1} \Delta Y_{t-p+1}$) and the error term ($\varepsilon_t \sim N(0, \sigma^2$)), $$\Delta Y_t = \beta_0 + \beta t + \gamma Y_{t-1} + \delta_1 \Delta Y_{t-1} + \dots + \delta_{p-1} \Delta Y_{t-p+1} + \varepsilon_t .$$ If the p-value $<\alpha$, where $\alpha$ is any significance level choosed by reasearcher, we reject $H_0$ (Fuller, Dickey 431).
Integer Differencing (log Returns)¶
As described by Prado (75), work by taking first difference of log-prices, $$r_t = \ln(Y_t) - \ln(Y_{t-1}) = \ln\left(\frac{Y_t}{Y_{t-1}}\right).$$
Fractional Differencing¶
According to Prado (75), because of arbitrage forces, financial time series exhibit low signal-to-noise ratios, and to make matters worse, standard stationarity transformations, like ADF test and integer differencing, etc., reduce the signal by removing memory and once these transformations wipe out the memory of the data, statisticians are forced to apply complex mathematical techniques to extract whatever residual signal remains.
However, applying such techniques to memory-erased series often leads to false discoveries (Prado 75). To address this issue, Prado (75) suggests using Fractional Differencing, introduced by Joyeux and Granger (1980) and Hosking (1981). Which instead of applying integer differencing, differentiate a time series by a degree $d \in (0,1)$, and try to find the minimum value of $d$ required to pass the ADF test. Prado (75) argues that this method preserves the maximum amount of memory/signal while still achieving stationarity.
To perform the test, we use bachshift operator (B), where $B^kY_t = Y_{t-k}$. We know from times series that:
$$ \begin{aligned} BY_t &= Y_{t-1}\\ (1 - B)Y_t &= Y_t - Y_{t-1}\\ (1 - B)^2 Y_t &= (Y_t - Y_{t-1}) - (Y_{t-1} - Y_{t-2})\\ \vdots\\ (1-B)^d Y_t &= \Delta^d Y_t.\\ (1-B)^d Y_t &=\sum_{k=0}^{\infty} {d \choose k}(-B)^k Y_t\\ &=\sum_{k=0}^{\infty}\frac{d(d-1)\cdots(d-k+1)}{k!}(-B)^k Y_t\\ \Delta^d Y_t&=\sum_{k=0}^\infty \frac{\prod_{i=0}^{k-1}(d-i)}{k!}(-1)^k B^k Y_t\\ \underbrace{\Delta^d Y_t}_{\widetilde{Y}_t} &=\sum_{k=0}^\infty \underbrace{\frac{\prod_{i=0}^{k-1}(d-i)}{k!}(-1)^k}_{w_k= -w_{k-1}\frac{d-k+1}{k}} Y_{t-k}\\ \widetilde{Y}_t&= \sum_{k=0}^\infty w_k Y_{t-k} \end{aligned} $$ Prado (80) states that to implement Fractional Differencing, we can use two alternatives: (1) an expanding window, or (2) a fixed-width window (fracdiff), since we do not have infinite data to apply infinitely long weights.
In the code below, we implement these mathematical expressions to solve step 1.
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import scipy.stats as stats
np.random.seed(42)
#
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 5)
# ---------------------------------------------------------
# 1. DATA ACQUISITION (LEVELS)
# ---------------------------------------------------------
ticker = "BTC-USD"
start_date = "2022-01-01"
end_date = "2023-12-31"
print(f"Acquiring financial data for {ticker}...")
# Download data
data = yf.download(ticker, start=start_date, end=end_date, progress=False, auto_adjust=True)
# Handle MultiIndex columns if present (common with newer yfinance versions)
if isinstance(data.columns, pd.MultiIndex):
try:
prices = data.xs('Close', level=0, axis=1)[ticker]
except KeyError:
# Fallback if structure is different
prices = data.iloc[:, 0] # Take first column assuming it's close
else:
prices = data['Close']
prices = prices.dropna()
print(f"Data loaded successfully. Observations: {len(prices)}")
# 1. Visual Inspection
plt.figure(figsize=(14, 5))
plt.plot(prices, color='navy', linewidth=1.5)
plt.title(f"Figure 1: {ticker} Prices (Levels)", fontsize=14, fontweight='bold')
plt.ylabel("Price (USD)")
plt.xlabel("Date")
plt.tight_layout()
plt.show()
# 2. Autocorrelation Analysis (ACF & PACF)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(prices, lags=40, ax=ax1, title='Figure 1.1: ACF-Levels (Slow Decay indicates Non-Stationarity)')
plot_pacf(prices, lags=40, ax=ax2, title='Figure 1.2: PACF - Levels')
plt.tight_layout()
plt.show()
# 3. Statistical Summary & Stationarity Test
desc_stats = prices.describe()
skewness = stats.skew(prices)
kurtosis = stats.kurtosis(prices)
adf_result = adfuller(prices)
print("\n" + "-"*60)
print("Table 1.a). STATISTICAL CHARACTERIZATION: LEVELS")
print("-"*60)
print(f"Mean: {desc_stats['mean']:.4f}")
print(f"Std Dev: {desc_stats['std']:.4f}")
print(f"Skewness: {skewness:.4f}")
print(f"Kurtosis: {kurtosis:.4f}")
print("-" * 30)
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"p-value: {adf_result[1]:.4f}")
print("Conclusion: " + ("STATIONARY" if adf_result[1] < 0.05 else "NON-STATIONARY (Fail to reject H0)"))
print("-"*60)
# ---------------------------------------------------------
# B. ANALYSIS OF TRANSFORMED SERIES (INTEGER DIFFERENCING)
# ---------------------------------------------------------
# 1. Transformation (Log Returns)
log_returns = np.log(prices / prices.shift(1)).dropna()
# 2. Visual Inspection
plt.figure(figsize=(14, 5))
plt.plot(log_returns, color='darkgreen', linewidth=1)
plt.title(f"Figure 2: Differenced Series: {ticker} Log Returns (d=1)", fontsize=14, fontweight='bold')
plt.ylabel("Log Return")
plt.tight_layout()
plt.show()
# 3. Autocorrelation Analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(log_returns, lags=40, ax=ax1, title='Figure 2.1: ACF- Log Returns (Memory Loss)')
plot_pacf(log_returns, lags=40, ax=ax2, title='Figure 2.2: PACF - Log Returns')
plt.tight_layout()
plt.show()
# 4. Statistical Summary
desc_stats_log = log_returns.describe()
skewness_log = stats.skew(log_returns)
kurtosis_log = stats.kurtosis(log_returns)
adf_result_log = adfuller(log_returns)
print("\n" + "-"*60)
print("Table 2b.) STATISTICAL CHARACTERIZATION: LOG RETURNS")
print("-"*60)
print(f"Mean: {desc_stats_log['mean']:.4f}")
print(f"Std Dev: {desc_stats_log['std']:.4f}")
print(f"Skewness: {skewness_log:.4f}")
print(f"Kurtosis: {kurtosis_log:.4f}")
print("-" * 30)
print(f"ADF Statistic: {adf_result_log[0]:.4f}")
print(f"p-value: {adf_result_log[1]:.4e}") # Scientific notation for very small p
print("Conclusion: " + ("STATIONARY" if adf_result_log[1] < 0.05 else "NON-STATIONARY"))
print("-"*60)
# ---------------------------------------------------------
# C. ANALYSIS OF FRACTIONAL DIFFERENCING
# ---------------------------------------------------------
def get_weights_ffd(d, thres=1e-5):
"""Calculate weights for fractional differencing using binomial expansion."""
w, k = [1.], 1
while True:
w_k = -w[-1] / k * (d - k + 1)
if abs(w_k) < thres:
break
w.append(w_k)
k += 1
return np.array(w[::-1])
def frac_diff_ffd(series, d, thres=1e-5):
"""Apply fixed-window fractional differencing."""
# 1. Compute weights
w = get_weights_ffd(d, thres)
width = len(w) - 1
# 2. Apply weights (Vectorized rolling dot product)
# Note: We apply to the raw series values
vals = series.values
if len(vals) < width:
return pd.Series(dtype=float)
res = np.full(len(vals), np.nan)
# Iterative application (robust for variable window sizes)
# For high performance on large data, stride_tricks can be used,
# but loop is sufficient for <2000 obs.
for i in range(width, len(vals)):
res[i] = np.dot(vals[i-width:i+1], w)
return pd.Series(res, index=series.index)
# 1. Optimization: Find minimum d for stationarity
print("\nOptimizing Fractional Differentiation parameter (d)...")
possible_ds = np.linspace(0.1, 0.9, 17) # Check 0.1, 0.15 ... 0.9
optimal_d = 1.0
frac_diff_series = None
for d_val in possible_ds:
temp_series = frac_diff_ffd(prices, d_val).dropna()
# Run ADF check
if len(temp_series) > 50: # Ensure sufficient data
p_val = adfuller(temp_series)[1]
if p_val < 0.05:
optimal_d = d_val
frac_diff_series = temp_series
print(f" Found stationary d = {optimal_d:.2f} (p-value: {p_val:.4f})")
break
if frac_diff_series is None:
print(" No stationary d < 1.0 found. Using d=1.0 (Log Returns equivalent).")
frac_diff_series = log_returns
optimal_d = 1.0
# 2. Visual Inspection
plt.figure(figsize=(14, 5))
plt.plot(frac_diff_series, color='darkred', linewidth=1)
plt.title(f"Figure 3: Fractionally Differenced Series (d={optimal_d:.2f})", fontsize=14, fontweight='bold')
plt.ylabel("Transformed Value")
plt.tight_layout()
plt.show()
# 3. Autocorrelation Analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(frac_diff_series, lags=40, ax=ax1, title=f'Figure 3.1: ACF- Frac Diff d={optimal_d:.2f} (Memory Preserved)')
plot_pacf(frac_diff_series, lags=40, ax=ax2, title='Figure 3.2: PACF - Frac Diff')
plt.tight_layout()
plt.show()
# 4. Statistical Summary
desc_stats_frac = frac_diff_series.describe()
skewness_frac = stats.skew(frac_diff_series)
kurtosis_frac = stats.kurtosis(frac_diff_series)
adf_result_frac = adfuller(frac_diff_series)
print("\n" + "-"*60)
print(f"Table 3c.) STATISTICAL CHARACTERIZATION: FRAC DIFF (d={optimal_d:.2f})")
print("-"*60)
print(f"Mean: {desc_stats_frac['mean']:.4f}")
print(f"Std Dev: {desc_stats_frac['std']:.4f}")
print(f"Skewness: {skewness_frac:.4f}")
print(f"Kurtosis: {kurtosis_frac:.4f}")
print("-" * 30)
print(f"ADF Statistic: {adf_result_frac[0]:.4f}")
print(f"p-value: {adf_result_frac[1]:.4f}")
print("Conclusion: " + ("STATIONARY" if adf_result_frac[1] < 0.05 else "NON-STATIONARY"))
print("-"*60)
Acquiring financial data for BTC-USD... Data loaded successfully. Observations: 729
------------------------------------------------------------ Table 1.a). STATISTICAL CHARACTERIZATION: LEVELS ------------------------------------------------------------ Mean: 28509.7581 Std Dev: 8321.9629 Skewness: 0.4995 Kurtosis: -0.8028 ------------------------------ ADF Statistic: -1.5208 p-value: 0.5232 Conclusion: NON-STATIONARY (Fail to reject H0) ------------------------------------------------------------
------------------------------------------------------------ Table 2b.) STATISTICAL CHARACTERIZATION: LOG RETURNS ------------------------------------------------------------ Mean: -0.0002 Std Dev: 0.0288 Skewness: -0.4091 Kurtosis: 5.5382 ------------------------------ ADF Statistic: -26.7875 p-value: 0.0000e+00 Conclusion: STATIONARY ------------------------------------------------------------ Optimizing Fractional Differentiation parameter (d)... Found stationary d = 0.60 (p-value: 0.0174)
------------------------------------------------------------ Table 3c.) STATISTICAL CHARACTERIZATION: FRAC DIFF (d=0.60) ------------------------------------------------------------ Mean: 817.3111 Std Dev: 973.0975 Skewness: 0.6413 Kurtosis: 1.1414 ------------------------------ ADF Statistic: -3.2466 p-value: 0.0174 Conclusion: STATIONARY ------------------------------------------------------------
d) Comment on the 3 representations of the data¶
When we look at the BTC-USD price series (Figure 1 and Table 1.a), it behaves pretty much like a typical financial price: it drifts around, shows long trends, and doesn’t settle around any fixed mean. The ACF confirms this, instead of dropping quickly, it fades out very slowly, which is a classic sign of a non-stationary random walk.
The PACF also shows a strong spike at lag 1, telling us that today’s price is basically yesterday’s price plus noise. The ADF test gives a very high p-value (0.5232), so statistically there’s no evidence against a unit root. The series keeps a lot of memory, but that also means its average and variance are unstable (Mean ≈ 28,509; SD ≈ 8,321), which makes it hard to use directly in models that assume stationarity.
When we transform the data into log returns (Figure 2 and Table 1.b), the behavior changes completely. The series now fluctuates around a constant level near zero, and the ACF drops to almost nothing right away, which basically shows that the transformation wiped out the long-term dependence.
The ADF test strongly confirms stationarity (p-value = 0.0000). The downside is that the distribution becomes much more heavy-tailed (kurtosis = 5.5382), meaning we get more extreme values than what a normal distribution would predict. So although the log-return series is easy to handle statistically, it loses the historical patterns that might have some predictive value.
The fractionally differenced series (with d = 0.60), shown in Figure 3 and Table 1.c, sits somewhere between these two extremes. Visually it looks stationary, but unlike the regular returns, it still shows some structure, it’s not just noise bouncing around zero.
The ADF test supports stationarity (p-value = 0.0174). What really stands out is that the ACF keeps some meaningful correlation at the early lags, which means the long-term memory hasn’t been completely destroyed. At the same time, the transformation removes the non-stationary drift we saw in the raw prices. Its kurtosis (1.1414) is also lower than that of the log returns, so the distribution has fewer extreme jumps.
Overall, this version keeps the useful memory while still meeting stationarity requirements, which is according to what Prado (75) described in his book.
Step 2¶
To work with a stationary series, we first converted the raw price levels $Y_t$ into log returns using, $$r_t = \ln(Y_t / Y_{t-1}).$$ This transformation helps keep the mean and variance more stable, which is essential before fitting any model (remember that we did this in Step 1). After that, we reshaped the return series into a supervised format using a sliding window: each sample uses the last 10 returns ($r_{t-1}, r_{t-2}, \dots, r_{t-10}$) to predict the current one, $r_t$. Because neural networks train faster when features are on similar scales, we standardized the inputs with Z-score normalization, $$z = \frac{r_t - \mu (r_t)}{\sigma (r_t)}$$ The model itself is a basic MLP. It takes the 10 lagged inputs, passes them through two hidden layers (64 and 32 units), both with ReLU activations, $$\sigma (r_t) =\max (0, r_t)$$ and then outputs a single value with a linear activation since we’re predicting a continuous return.
For training, I used the Adam optimizer and minimized MSE, $$MSE = \frac{1}{n} \sum_{t=1}^{n} (r_t - \hat{r}_t)^2.$$ At the end, we evaluated how well the model did using RMSE, $RMSE = \sqrt{MSE},$ and the ($R^2$) score.
The theorical information beyond MLP its the universal aproximation functions, wich can aproximate any unknown functions $f(r_t)$ by $\tilde{f} (r_t)$, that's, The Universal Approximation Theorem, first proved by Funahashi (1989), basically shows when a three-layer feedforward neural network (that is, one hidden layer) can approximate any continuous function with any desired level of accuracy. Suppose we have a compact set $K \subset \mathbb{R}^n$ and a continuous function $f : K \rightarrow \mathbb{R}$, if the activation function $\sigma(r_t) : \mathbb{R} \to \mathbb{R}$ is non-constant, bounded, increasing, and continuous, then for any approximation error ($\varepsilon > 0$) it is possible to build a network with some number (N) of hidden neurons, weights $w_{ij} \in \mathbb{R} \quad (i = 1,\ldots,N; j = 1,\ldots,n),$ biases $b_i \in \mathbb{R},$ and output weights $c_i \in \mathbb{R},$ such that the network’s output $ \tilde{f}(r_1,\ldots,r_n) = \sum_{t=1}^{N} c_i , \sigma (r_t) \left( \sum_{j=1}^{n} w_{ij} r_t + b_i \right) $ gets arbitrarily close to (f). Formally, $ \max_{x \in K} \left| f(x_1,\ldots,x_n) - \tilde{f}(x_1,\ldots,x_n) \right| < \varepsilon , $ meaning $\tilde{f}$ approximates (f) as closely as we want.
A later result by Leshno et al. (1993) extended this idea. They showed that a neural network $\tilde{f}(x)$ whose activation function $\sigma(r_t)$ is locally bounded and piecewise continuous can approximate any continuous function on a compact set $K \subset \mathbb{R}^n$, that is, any $f(r_t) \in C(\mathbb{R}^n)$, if and only if the activation function $\sigma (r_t)$ is not a polynomial.
In the code below, we implement these mathematical expressions to solve step 2.
## Code suggestion
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
import yfinance as yf
from statsmodels.tsa.stattools import adfuller
import warnings
# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")
#
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (18, 6)
# ---------------------------------------------------------
# 1. DATA ACQUISITION & PREPARATION (Re-running Step 1 logic)
# ---------------------------------------------------------
ticker = "BTC-USD"
start_date = "2022-01-01"
end_date = "2023-12-31"
data = yf.download(ticker, start=start_date, end=end_date, progress=False, auto_adjust=True)
if isinstance(data.columns, pd.MultiIndex):
prices = data.xs('Close', level=0, axis=1)[ticker].dropna()
else:
prices = data['Close'].dropna()
# Transformations
# 1. Log Returns
log_returns = np.log(prices / prices.shift(1)).dropna()
# 2. Fractional Differencing
def get_weights_ffd(d, thres=1e-5, max_len=1000):
w, k = [1.0], 1
while True:
w_k = -w[-1] * (d - k + 1) / k
if abs(w_k) < thres or k >= max_len: break
w.append(w_k)
k += 1
return np.array(w[::-1])
def frac_diff_ffd(series, d, thres=1e-5):
w = get_weights_ffd(d, thres)
width = len(w) - 1
# Apply weights to series values
vals = series.values
res = np.full(len(vals), np.nan)
for i in range(width, len(vals)):
res[i] = np.dot(vals[i-width:i+1], w)
return pd.Series(res, index=series.index)
# Optimize d
best_d = 1.0
frac_diff_series = None
possible_ds = np.linspace(0.1, 0.9, 17)
for d_val in possible_ds:
temp_series = frac_diff_ffd(prices, d_val).dropna()
if len(temp_series) > 50:
p_val = adfuller(temp_series)[1]
if p_val < 0.05:
best_d = d_val
frac_diff_series = temp_series
break
if frac_diff_series is None:
frac_diff_series = log_returns # Fallback
best_d = 1.0
print(f"Data Prepared. Optimal d for FracDiff: {best_d:.2f}")
# ---------------------------------------------------------
# 2. MODELING FUNCTIONS
# ---------------------------------------------------------
def create_supervised_dataset(series, n_lags=10):
"""Creates lagged features (X) and target (y)"""
vals = series.values
X, y = [], []
for i in range(n_lags, len(vals)):
X.append(vals[i-n_lags:i])
y.append(vals[i])
return np.array(X), np.array(y)
def build_mlp_model(input_dim):
"""Constructs the MLP architecture"""
model = Sequential([
Dense(64, activation='relu', input_shape=(input_dim,)),
Dense(32, activation='relu'),
Dense(1, activation='linear')
])
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
return model
def train_and_evaluate(series, name, n_lags=10, epochs=100, batch_size=32):
"""Trains MLP and returns results + history for plotting"""
# 1. Data Prep
X, y = create_supervised_dataset(series, n_lags)
# Train/Test Split (Chronological)
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
test_dates = series.index[n_lags + split:]
# Scaling
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1))
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1))
# 2. Model Training
model = build_mlp_model(n_lags)
es = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)
history = model.fit(
X_train_scaled, y_train_scaled,
validation_split=0.2,
epochs=epochs,
batch_size=batch_size,
callbacks=[es],
verbose=0
)
# 3. Prediction
pred_scaled = model.predict(X_test_scaled, verbose=0)
y_pred = scaler_y.inverse_transform(pred_scaled).flatten()
# 4. Metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
return {
"name": name,
"rmse": rmse,
"mae": mae,
"r2": r2,
"y_test": y_test,
"y_pred": y_pred,
"dates": test_dates,
"history": history.history
}
# ---------------------------------------------------------
# 3. EXECUTION & VISUALIZATION
# ---------------------------------------------------------
# Run experiments
results_a = train_and_evaluate(prices, "a) Levels (Non-Stationary)")
results_b = train_and_evaluate(log_returns, "b) Log Returns (Stationary)")
results_c = train_and_evaluate(frac_diff_series, f"c) Frac Diff (d={best_d:.2f})")
experiments = [results_a, results_b, results_c]
# Visualization Loop
for res in experiments:
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
plt.suptitle(f"Model Performance: {res['name']}", fontsize=16, y=1.05, fontweight='bold')
# Plot 1: Observed vs Fitted (Time Series)
# We zoom in on the last 100 points for clarity if series is long
display_len = 150
ax1 = axes[0]
ax1.plot(res['dates'][-display_len:], res['y_test'][-display_len:], label='Observed', color='black', linewidth=1.5, alpha=0.7)
ax1.plot(res['dates'][-display_len:], res['y_pred'][-display_len:], label='Fitted (Pred)', color='red', linestyle='--', linewidth=1.5)
ax1.set_title("Figure 4: Observed vs Fitted (Last 150 Days)", fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot 2: Loss Function (Training vs Validation)
ax2 = axes[1]
loss = res['history']['loss']
val_loss = res['history']['val_loss']
ax2.plot(loss, label='Training Loss', color='blue')
ax2.plot(val_loss, label='Validation Loss', color='orange')
ax2.set_title("Figure 4.1: Learning Curve (MSE Loss)", fontsize=12, fontweight='bold')
ax2.set_xlabel("Epochs")
ax2.set_ylabel("Loss (MSE)")
ax2.legend()
ax2.grid(True, alpha=0.3)
# Plot 3: Scatter Plot (Actual vs Predicted) - Good for spotting bias
ax3 = axes[2]
ax3.scatter(res['y_test'], res['y_pred'], alpha=0.5, color='purple', edgecolors='w')
# Perfect prediction line
min_val = min(res['y_test'].min(), res['y_pred'].min())
max_val = max(res['y_test'].max(), res['y_pred'].max())
ax3.plot([min_val, max_val], [min_val, max_val], 'k--', lw=2, label='Perfect Fit')
ax3.set_title(f"Figure 4.2: Prediction Accuracy (R² = {res['r2']:.4f})", fontsize=12, fontweight='bold')
ax3.set_xlabel("Observed Values")
ax3.set_ylabel("Predicted Values")
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ---------------------------------------------------------
# 4. SUMMARY STATISTICS TABLE
# ---------------------------------------------------------
summary_data = []
for res in experiments:
summary_data.append({
"Model": res['name'],
"RMSE": res['rmse'],
"MAE": res['mae'],
"R² Score": res['r2']
})
summary_df = pd.DataFrame(summary_data)
print("\n" + "-"*60)
print("STEP 2: PREDICTION PERFORMANCE SUMMARY")
print("-"*60)
print(summary_df.to_string(index=False, float_format="%.6f"))
print("-"*60)
Acquiring data for BTC-USD... Data Prepared. Optimal d for FracDiff: 0.60
WARNING:tensorflow:5 out of the last 11 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x7fdef267afc0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
------------------------------------------------------------
STEP 2: PREDICTION PERFORMANCE SUMMARY
------------------------------------------------------------
Model RMSE MAE R² Score
a) Levels (Non-Stationary) 881.596759 627.592312 0.981129
b) Log Returns (Stationary) 0.022090 0.014657 -0.073436
c) Frac Diff (d=0.60) 1132.062242 847.689191 -0.234766
------------------------------------------------------------
d) Prediction performance of the three models¶
When comparing the MLP models, using the raw price levels the model is doing extremely well: in Figure 4 (top left), the fitted values follow the actual prices almost perfectly, and the scatter plot in Figure 4.2 (top right) shows an almost straight line with an $R^2$ of 0.9811.
With log returns, we get a fully stationary series, but we also lose almost all the useful signal. In the bottom panel of Figure 4, the fitted curve is practically flat, and in Figure 4.2 (meadle) the points form a horizontal cloud with an ($R^2 = -0.0734$).
The fractionally differenced (bottom plots) series (with $d = 0.60$) sits theoretically between the two extremes, but in practice it didn’t behave well here. The learning curve in Figure 4.1 shows the training loss going down while the validation loss goes up, a classic overfitting pattern. That mismatch explains the weak fit in Figure 4 and the negative $R^2$ of -0.2348 in Figure 4.2. Even though fractional differencing preserved some of the long-term dependence, the model still failed to generalize on the test set.
This results was expected because when we make time series stationary our model will try to minimize the loss functions but because there are many local minimum values, it will be very difficult to achieve this, why for non-stationary data we have a clear patterns and our model if don't struggle in any local minimum will fit well.
Step 3¶
To solve this step we used the same data used in step 1 and 2, which is related to a Bitcoin's price movements. We processed daily closing prices $Y_t$ from 2017 through 2023 using distinct transformations to handle the inherent non-stationarity of financial data. While we used standard log returns defined as, $$r_t = \ln(Y_t / Y_{t-1})$$ as a baseline, the more advanced preprocessing involved fractional differentiation to preserve long-term memory. This technique expands the difference operator $(1-B)^d$ using a binomial series, resulting in a transformed series, $$\tilde{Y}_t = \sum_{k=0}^{\infty} w_k Y_{t-k}.$$ The weights $w_k$ are determined iteratively by $$w_k = -w_{k-1} \frac{d - k + 1}{k}$$, where $d$ represents the fractional order (optimized to 0.4 in the script) and $B$ is the backshift operator (see step 2 for more detalis).
Once the data was prepared, we converted the rolling time-series windows into 2D images using Gramian Angular Fields (GAF) to leverage the pattern recognition capabilities of Convolutional Neural Networks (CNNs) (Oates, Wang 2). We first normalized the series within each window to $\tilde{x}_i$ in the range $[-1, 1]$ and computed the temporal correlation matrix using the trigonometric identity $$G_{i,j} = \cos(\phi_i + \phi_j).$$ By substituting $\phi_i = \arccos(\tilde{x}_i)$, this expression expands to $$G_{i,j} = \tilde{x}_i \tilde{x}_j - \sqrt{1 - \tilde{x}_i^2} \sqrt{1 - \tilde{x}_j^2},$$ effectively encoding the time dependencies into a spatial structure.
These images were then fed into a CNN trained with the Huber Loss function (Huber 492–518), which was chosen for its robustness against the extreme outliers often found in cryptocurrency markets. This function behaves quadratically for small errors but linearly for large ones, defined as $$L_\delta(y, f(x)) = \begin{cases} \frac{1}{2}(y - f(x))^2 & \text{for } |y - f(x)| \le \delta \\ \delta |y - f(x)| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$. To conclude the analysis, we measured the model's performance using the Root Mean Squared Error, calculated as $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2},$$ which aggregates the differences between the predicted values 2$\hat{y}_i$ and the actual target prices 3$y_i$.
The code bellow implement this.
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from statsmodels.tsa.stattools import adfuller
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, MaxPooling2D, Dropout, BatchNormalization, Input, LeakyReLU, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import Huber
from tensorflow.keras import regularizers
# ---------------------------------------------------------
# GLOBAL CONFIGURATION
# ---------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (20, 6)
# Set seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
# HYPERPARAMETERS OPTIMIZED
WINDOW_SIZE = 48
BATCH_SIZE = 32
EPOCHS = 100
LEARNING_RATE = 0.0005
# ---------------------------------------------------------
# 1. DATA LOADING & ADVANCED PREPARATION
# ---------------------------------------------------------
ticker = "BTC-USD"
data = yf.download(ticker, start="2017-01-01", end="2023-12-31", progress=False, auto_adjust=True)
if isinstance(data.columns, pd.MultiIndex):
prices = data.xs('Close', level=0, axis=1)[ticker].dropna()
else:
prices = data['Close'].dropna()
# A. Log Returns (Stationary)
log_returns = np.log(prices / prices.shift(1)).dropna()
# B. Optimized Fractional Differencing (Find best 'd')
def get_weights_ffd(d, thres=1e-5, limit=1000):
w, k = [1.0], 1
while True:
w_k = -w[-1] * (d - k + 1) / k
if abs(w_k) < thres or k >= limit: break
w.append(w_k)
k += 1
return np.array(w[::-1])
def frac_diff_ffd(series, d, thres=1e-5):
# Forward fill to handle tiny gaps before calculation
series = series.ffill()
w = get_weights_ffd(d, thres)
width = len(w) - 1
if len(series) <= width: return pd.Series(dtype=float)
return series.rolling(window=width+1).apply(lambda x: np.dot(x, w), raw=True).dropna()
best_d = 0.4 # Default
min_p_value = 1.0
# Search for the minimum 'd' that makes the series stationary (p-value < 0.05)
for d_val in np.linspace(0.1, 0.9, 9):
temp_diff = frac_diff_ffd(prices, d_val)
if len(temp_diff) > 100:
p_val = adfuller(temp_diff.dropna())[1]
if p_val < 0.05:
best_d = d_val
print(f"Best d found: {best_d:.2f} (ADF p-value: {p_val:.4f})")
break
frac_diff_series = frac_diff_ffd(prices, best_d)
# Alignment
common_index = prices.index.intersection(log_returns.index).intersection(frac_diff_series.index)
prices = prices.loc[common_index]
log_returns = log_returns.loc[common_index]
frac_diff_series = frac_diff_series.loc[common_index]
# ---------------------------------------------------------
# 2. GAF
# ---------------------------------------------------------
def series_to_gaf(window_data):
# Local MinMax Scaling per window to [-1, 1]
# This captures the *shape* of the trend regardless of the absolute price level
min_val = np.min(window_data)
max_val = np.max(window_data)
if max_val == min_val:
x_scaled = np.zeros_like(window_data)
else:
x_scaled = ((window_data - min_val) / (max_val - min_val)) * 2 - 1
x_cos = x_scaled
x_sin = np.sqrt(1 - np.clip(x_scaled**2, 0, 1))
# Gramian Angular Summation Field
gaf = np.outer(x_cos, x_cos) - np.outer(x_sin, x_sin)
return gaf
def create_dataset(series, window_size, flatten=False):
vals = series.values
X, y = [], []
for i in range(window_size, len(vals)):
window = vals[i-window_size:i]
target = vals[i]
img = series_to_gaf(window)
if flatten:
X.append(img.flatten())
else:
X.append(img) # Shape: (Window, Window)
y.append(target)
X = np.array(X)
y = np.array(y)
if not flatten:
X = X.reshape(-1, window_size, window_size, 1)
return X, y
# ---------------------------------------------------------
# 3. HIGH-PERFORMANCE MODEL ARCHITECTURES
# ---------------------------------------------------------
def build_cnn_optimized(input_shape):
"""
Improved CNN with LeakyReLU, L2 Regularization, and Residual-like depth
to capture complex GAF patterns.
"""
model = Sequential([
Input(shape=input_shape),
# Block 1: Feature Extraction
Conv2D(32, (3,3), padding='same', kernel_regularizer=regularizers.l2(0.001)),
LeakyReLU(alpha=0.1),
BatchNormalization(),
MaxPooling2D(2,2),
# Block 2: Abstract Features
Conv2D(64, (3,3), padding='same', kernel_regularizer=regularizers.l2(0.001)),
LeakyReLU(alpha=0.1),
BatchNormalization(),
MaxPooling2D(2,2),
# Block 3: Deep Features
Conv2D(128, (3,3), padding='same'),
LeakyReLU(alpha=0.1),
BatchNormalization(),
MaxPooling2D(2,2),
Flatten(),
# Dense Layers
Dense(128),
LeakyReLU(alpha=0.1),
Dropout(0.4), # Robust Dropout
Dense(64),
LeakyReLU(alpha=0.1),
Dropout(0.3),
Dense(1, activation='linear')
])
# Huber Loss is robust to outliers (spikes in Bitcoin)
model.compile(optimizer=Adam(learning_rate=LEARNING_RATE), loss=Huber(delta=1.0))
return model
def build_mlp_optimized(input_dim):
"""Deep MLP for Flattened Inputs"""
model = Sequential([
Input(shape=(input_dim,)),
Dense(512),
LeakyReLU(alpha=0.1),
BatchNormalization(),
Dropout(0.5),
Dense(256),
LeakyReLU(alpha=0.1),
BatchNormalization(),
Dropout(0.5),
Dense(64, activation='relu'),
Dense(1, activation='linear')
])
model.compile(optimizer=Adam(learning_rate=LEARNING_RATE), loss='mse')
return model
# ---------------------------------------------------------
# 4. TRAINING LOOP
# ---------------------------------------------------------
configs = [
(prices, "a) Levels (CNN)", "CNN"),
(log_returns, "b) Log Returns (MLP)", "MLP"),
(frac_diff_series, f"c) FracDiff (d={best_d:.2f}) (CNN)", "CNN")
]
results = []
print("\n--- STARTING OPTIMIZED TRAINING ---")
for series_data, name, model_type in configs:
print(f"\n>> Training: {name}")
# 1. Dataset Generation
is_mlp = (model_type == "MLP")
X, y = create_dataset(series_data, WINDOW_SIZE, flatten=is_mlp)
# Split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
dates_test = series_data.index[WINDOW_SIZE + split_idx:]
# 2. TARGET SCALING (CRITICAL FIX FOR MODEL A)
# Using MinMaxScaler for Model A to bound outputs between 0 and 1, easier for NN to predict
if "Levels" in name:
scaler_y = MinMaxScaler(feature_range=(0, 1))
else:
scaler_y = StandardScaler()
y_train_s = scaler_y.fit_transform(y_train.reshape(-1, 1))
y_test_s = scaler_y.transform(y_test.reshape(-1, 1))
# 3. Model
if is_mlp:
model = build_mlp_optimized(WINDOW_SIZE * WINDOW_SIZE)
else:
model = build_cnn_optimized((WINDOW_SIZE, WINDOW_SIZE, 1))
# 4. Training Callbacks
callbacks = [
EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True, verbose=0),
ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-6, verbose=0)
]
# 5. Fit
history = model.fit(
X_train, y_train_s,
validation_split=0.2,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
callbacks=callbacks,
verbose=0
)
# 6. Predict
pred_s = model.predict(X_test, verbose=0)
pred = scaler_y.inverse_transform(pred_s).flatten()
# Metrics
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)
# Save for plotting
img_sample = X_test[-1].reshape(WINDOW_SIZE, WINDOW_SIZE)
results.append({
"Model": name,
"RMSE": rmse,
"R2": r2,
"y_true": y_test,
"y_pred": pred,
"dates": dates_test,
"hist": history.history,
"img": img_sample
})
# ---------------------------------------------------------
# 5. VISUALIZATION (Step 3d)
# ---------------------------------------------------------
for res in results:
fig, axes = plt.subplots(1, 3, figsize=(24, 6))
# Plot 1: Prediction vs Real (Zoomed in on last 200 points for detail)
ax1 = axes[0]
zoom = 200
dates = res['dates'][-zoom:]
real = res['y_true'][-zoom:]
pred = res['y_pred'][-zoom:]
ax1.plot(dates, real, 'k-', linewidth=1.5, label='Actual')
ax1.plot(dates, pred, 'r--', linewidth=1.5, label='Predicted')
ax1.set_title(f"{res['Model']}", fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.tick_params(axis='x', rotation=45)
# Plot 2: Loss Curve
ax2 = axes[1]
hist = res['hist']
ax2.plot(hist['loss'], label='Train Loss', color='blue')
ax2.plot(hist['val_loss'], label='Val Loss', color='orange')
ax2.set_title("Training Convergence (Huber Loss)", fontsize=14, fontweight='bold')
ax2.set_xlabel("Epochs")
ax2.legend()
ax2.grid(True, linestyle='--')
# Plot 3: GAF Representation
ax3 = axes[2]
im = ax3.imshow(res['img'], cmap='Spectral', origin='lower', vmin=-1, vmax=1)
ax3.set_title("GAF Input (Correlation Structure)", fontsize=14, fontweight='bold')
fig.colorbar(im, ax=ax3, fraction=0.046, pad=0.04)
ax3.axis('off')
plt.tight_layout()
plt.show()
# Summary Table
summary = pd.DataFrame([{k: v for k, v in r.items() if k in ['Model', 'RMSE']} for r in results])
print(summary.to_string(index=False))
Best d found: 0.50 (ADF p-value: 0.0155) --- STARTING OPTIMIZED TRAINING --- >> Training: a) Levels (CNN) >> Training: b) Log Returns (MLP) >> Training: c) FracDiff (d=0.50) (CNN)
Model RMSE
a) Levels (CNN) 16083.763002
b) Log Returns (MLP) 0.029398
c) FracDiff (d=0.50) (CNN) 1467.682251
d) Prediction performance of the three models¶
When we look at the model trained on the raw price levels (figure below), we see that the model basically falls into a persistence trap. In the prediction plot, the CNN doesn’t really “predict” anything, it just repeats the last value it saw, that’s why the red line sits almost perfectly on top of the black one, just shifted a bit forward. This slight delay also explains the huge RMSE of 16083.
The convergence plot backs this up and the validation loss jumps around and stays consistently higher than the training loss, which is a sign that the model isn’t learning anything useful beyond memorizing the previous value. The GAF image it’s mostly a smooth gradient with very little contrast and because raw prices are highly non-stationary and strongly autocorrelated, the image ends up with almost no structure for the CNN to latch onto.
The model built on log returns (as shown below) behaves completely differently. Once the series is turned stationary, the prediction plot flattens around zero. The model manages to follow the general level of volatility, but it can’t really capture the large jumps, which is normal, since log returns often look like white noise. Here, at least, the convergence plots look healthy. The training and validation losses drop together, suggesting the MLP learned the basic statistical properties of the series without overfitting. The GAF image looks like random noise because by fully removing the trend and most of the memory, the transformation also destroys the spatial structure the image should capture.
The fractionally differenced model (shown below) sits somewhere between the first two approaches. In the prediction plot, it follows the overall trend better than the log-returns model and avoids the heavy lag we saw with the raw levels. It still misses some of the sharper turning points, but overall it behaves more sensibly. The convergence curve, however, starts off rough; that huge early spike in validation loss suggests the model was unstable at the beginning, possibly exploding gradients until it eventually calmed down. The GAF image shows a much richer, more complex pattern.
This makes sense, since fractional differencing keeps part of the long-term memory while still making the series stationary enough.
Step 4: Discussion of the different results obtained between the CNN and MLP architectures¶
The difference in our result can be justified by:
The MLP treats the time lags as a flat vector of independent features, which works fine for simple regression but struggles hard when the underlying distribution shifts, like we saw with the non-stationary raw levels. It basically overfits the specific price range it trained on and fails when the market moves. The CNN, however, is hunting for spatial dependencies, edges, textures, and shapes within the GAF representation. This explains why it failed so hard on the raw levels; because the prices are so highly correlated, the resulting GAF image is just a smooth, featureless gradient. There were no "edges" for the kernels to detect, so the model just lazily converged to a persistence baseline, repeating the last known value.
That’s why the fractionally differenced data was the most interesting case. Unlike the log returns, which just looked like static noise to the CNN, the fractional differencing preserved enough memory to create a textured "checkerboard" pattern in the GAF plots. We think that this is the only scenario where the CNN actually has a theoretical edge, because it can spatially map those complex correlation structures that the MLP misses by flattening everything. Essentially, the MLP hits a ceiling because it's too simple for the complex dynamics, while the CNN is more powerful but needs the input "image" to actually have contrast and structure, which we only really achieved with the fractional differentiation.
Conclusion¶
We learned over the course of these three steps how different transformations of a time series can completely change both statistical properties of data and also the way machine-learning models behave. In the raw BTC price levels, the series is highly non-stationary and full of long trends, which makes it statistically messy but curiously easy for models to predict, since they mostly just follow the drift. Once we convert the data to log returns, it becomes nicely stationary, but at the cost of losing almost all signal models predict noise. Fractional differencing gave us an useful middle ground in that it stabilized the series while keeping some memory.
When we train MLPs and CNNs on each of these versions, the results reflect the tradeoff where models trained on raw prices look good on paper but are really just copying yesterday's value; models trained on log returns are stable but have nothing meaningful to learn. The fractionally-differenced version behaved consistently better as the models capture more structure than in returns and avoid the misleading drift in the raw levels.
The project demonstrated a more general point which is how we transform the data is just as important as the model we use. Good predictive performance comes from a balance between removal of non-stationarity and retaining enough structure for a model to learn from it.
References¶
Granger, Clive WJ, and Roselyne Joyeux. "An introduction to long‐memory time series models and fractional differencing." Journal of time series analysis 1.1 (1980): 15-29.
Hosking, J. R. M. (1981). Fractional differencing. Biometrika, 68(1), 165–176.
Prado, M. L. (2018). The Evaluation and Optimization of Trading Strategies. John Wiley & Sons.
Leshno, Moshe, et al. “Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function.” Neural Networks, vol. 6, no. 6, 1993, pp. 861–867.
Funahashi, Ken-Ichi. “On the Approximate Realization of Continuous Mappings by Neural Networks.” Neural Networks, vol. 2, no. 3, 1989, pp. 183–192.
Huber, Peter J. “Robust Estimation of a Location Parameter.” Breakthroughs in Statistics: Methodology and Distribution, Springer, 1992, pp. 492–518.
Wang, Zhiguang, and Tim Oates. “Imaging Time-Series to Improve Classification and Imputation.” arXiv, 2015, arXiv:1506.00327.