Financial Econometrics Group Work Project 2¶

Introduction¶

Time series models serve as pivotal tools for forecasting, risk assessment, and strategic decision-making. However, their efficacy is often compromised by inherent challenges such as abrupt structural shifts, interdependencies among predictors, and the computational burden of high-dimensional data. This work addresses three critical obstacles: detecting regime changes (Section 1), handling multicollinearity (Section 2), and feature extraction (Section 3).

Regime changes sudden shifts in data dynamics can invalidate assumptions of stationarity, demanding robust detection mechanisms like Markov switching models or structural break tests. Multicollinearity, where predictors exhibit linear dependencies, obscures interpretability and inflates variance, requiring techniques such as variance inflation factor (VIF) analysis or dimensionality reduction. Feature extraction, meanwhile, streamlines complex datasets by retaining informative patterns, balancing efficiency with fidelity to original data. Through definitions, diagnostic frameworks, and visualizations, we demonstrate practical solutions to these issues.

Regime change in time series modeling¶

A regime change in time series modeling refers to an abrupt structural shift in the statistical properties of a process such as its mean, variance, or autocorrelation due to transitions between latent states (unobserved regimes) or deterministic structural breaks (Hamilton 357; Tsay 148). These shifts often arise from economic, political, or market dynamics, such as financial crises, policy changes, or technological disruptions. Two primary frameworks model regime changes:

(1) Structural Break Models: Structural breaks occur at specific points $\tau$, where the data-generating process abruptly changes. For example, an autoregressive process transitions between regimes as:
$$ y_t = \begin{cases} \mu_1 + \phi_1 y_{t-1} + \epsilon_t & t \leq \tau, \\ \mu_2 + \phi_2 y_{t-1} + \epsilon_t & t > \tau, \end{cases} \quad \epsilon_t \sim \mathcal{N}(0, \sigma^2), $$
where $\mu_1 \neq \mu_2$ or $\phi_1 \neq \phi_2$. The breakpoint $\tau$ is identified using statistical tests like the Chow test or Bai-Perron test (Tsay 148).

(2) Latent State Transition Models: Latent states govern regime shifts through an unobserved variable $S_t \in \{1, 2, \dots, K\}$, evolving via a first-order Markov chain. The Markov Switching Autoregressive (MS-AR) model is:
$$ y_t = \mu_{S_t} + \sum_{i=1}^p \phi_{i,S_t} (y_{t-i} - \mu_{S_{t-i}}) + \sigma_{S_t} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, 1), $$
with transitions defined by a matrix $\mathbf{P} = [p_{ij}]$:
$$ p_{ij} = \mathbb{P}(S_t = j \mid S_{t-1} = i), \quad \sum_{j=1}^K p_{ij} = 1 \quad \text{(Hamilton 357)}. $$

Detection Methods¶

(1) Hidden Markov Models (HMMs): The posterior probability of regime $k$ at time $t$ is computed via the forward-backward algorithm:
$$ \mathbb{P}(S_t = k \mid y_{1:T}) = \frac{\alpha_t(k) \beta_t(k)}{\sum_{j=1}^K \alpha_t(j) \beta_t(j)}, $$
where $\alpha_t(k) = \mathbb{P}(y_{1:t}, S_t = k)$ (forward probability) and $\beta_t(k) = \mathbb{P}(y_{t+1:T} \mid S_t = k)$ (backward probability) (Chen and Tsang 7).

(2) Hybrid Approaches: Bayesian methods automate regime identification by imposing hierarchical priors on parameters and transition probabilities, enabling flexible modeling of regime persistence and uncertainty (Sanquer et al. 4). These models are critical in finance and economics for capturing non-stationary behaviors (e.g., volatility clustering, fat tails) and improving forecasting accuracy during turbulent periods (Hamilton 362; Chen and Tsang 1).

Description¶

A regime change in time series modeling refers to an abrupt structural break which can be contiguous (occurring at a single point, dividing the series into adjacent segments) or non-contiguous (multiple disjoint breaks) or a probabilistic shift in statistical properties (e.g., mean, variance).

Demonstration¶

For demonstration and other parts requiring data, we used the unemployment rates (2000-2023) dataset sourced from FRED. We chose this data because unemployment rates probably will exhibit abrupt structural breaks during economic crises, such as the 2008–2009 global financial crisis (triggered by housing market collapses and banking failures), followed by a low-unemployment regime during expansions (2010–2019) marked by steady economic growth, and a sharp spike in 2020 due to the COVID-19 recession (driven by lockdowns and labor market disruptions).

In [ ]:
!pip install pandas_datareader statsmodels matplotlib --quiet

import os
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as web
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
import warnings
warnings.filterwarnings('ignore')
#==FRED API key
os.environ["FRED_API_KEY"] = "ec67a401ae513935b544843d8c308536"
df = web.DataReader("UNRATE", "fred", start="2000-01-01", end="2023-12-31")
df.head()
Out[ ]:
UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
In [ ]:
plt.figure(figsize=(12, 6))
plt.plot(df, label="U.S. Unemployment Rate (%)", color="navy")

plt.title("Figure 1: Monthly U.S. Unemployment Rate (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Rate (%)")

#=== Convert date strings to datetime objects
crisis_2008 = pd.to_datetime('2008-01-01')
covid_2020 = pd.to_datetime('2020-03-01')
recovery_2021 = pd.to_datetime('2021-01-01')

#=== Add vertical lines with datetime objects
plt.axvline(x=crisis_2008, color='red', linestyle='--', linewidth=1.2)
plt.text(crisis_2008, 11, '->2008 Crisis', color='red', rotation=30, verticalalignment='center')

plt.axvline(x=covid_2020, color='black', linestyle='--', linewidth=1.2)
plt.text(covid_2020, 5, '->COVID-19 Shock', color='black', rotation=80, verticalalignment='center')

plt.axvline(x=recovery_2021, color='purple', linestyle='--', linewidth=1.2)
plt.text(recovery_2021, 13, '->Recovery Phase', color='purple', rotation=30, verticalalignment='center')

plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

The Figure 1 of the U.S. unemployment rate from 2000 to 2023 clearly exhibits three distinct regime changes marked by structural shifts in the labor market. The first shift occurs during the 2008 financial crisis, with a sharp and sustained rise in unemployment, followed by a gradual recovery. The second and most abrupt regime change is the COVID-19 shock in 2020, characterized by an unprecedented spike in unemployment, quickly transitioning into a recovery phase from 2021 onward, reflecting a rapid labor market rebound.

In [ ]:
# Fit 2-regime MS-AR(1) model
model = MarkovRegression(df, k_regimes=2, trend='c', order=1)
result = model.fit()

print(result.summary())
                        Markov Switching Model Results                        
==============================================================================
Dep. Variable:                 UNRATE   No. Observations:                  288
Model:               MarkovRegression   Log Likelihood                -491.836
Date:                Tue, 08 Apr 2025   AIC                            993.672
Time:                        22:25:32   BIC                           1011.987
Sample:                    01-01-2000   HQIC                          1001.012
                         - 12-01-2023                                         
Covariance Type:               approx                                         
                             Regime 0 parameters                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9206      0.109     45.092      0.000       4.707       5.135
                             Regime 1 parameters                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.1819      0.428     16.789      0.000       6.343       8.020
                           Non-switching parameters                           
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
sigma2         1.9514      0.548      3.562      0.000       0.878       3.025
                         Regime transition parameters                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
p[0->0]        0.9957      0.003    326.515      0.000       0.990       1.002
p[1->0]        0.0460      0.030      1.511      0.131      -0.014       0.106
==============================================================================

Warnings:
[1] Covariance matrix calculated using numerical (complex-step) differentiation.

Interpretation: The model identifies two regimes: regime 0 (low unemployment, mean = 4.92%) and regime 1 (high unemployment, mean = 7.18%). Regime 0 is highly persistent (99.6% chance of remaining), while transitions from Regime 1 to Regime 0 are rare (4.6% probability). The high variance (1.95) reflects volatility in unemployment rates during shocks. Log-likelihood (-491.8) and AIC (993.7) suggest moderate fit, aligning with cyclical labor market dynamics. Results confirm structural breaks during crises (e.g., 2008, 2020).

Structural Break Detection (Bai-Perron), Hidden Markov Models (HMMs) and Bayesian Change-Point Detection¶

While approaches like Markov Switching Models require the number of regimes to be specified in advance (e.g., k_regimes=2), alternative methods allow the data to inform this choice more flexibly (Hamilton 98). Bayesian techniques such as change-point detection infer both the number and locations of structural shifts probabilistically, often through MCMC sampling (Barry and Hartigan 101). Non-parametric Bayesian models like Dirichlet Process Mixtures further relax assumptions by allowing for an unbounded number of regimes, adapting complexity to the data (Teh et al. 215). Other strategies, including Hidden Markov Models, estimate latent states and use model selection criteria such as AIC or BIC to identify the optimal number of states (Rabiner 267). Additionally, frequentist methods like structural break detection (e.g., Bai-Perron) identify statistically significant breakpoints by evaluating model fit across potential segmentations (Bai and Perron 72).

Structural Break Detection (Bai-Perron)¶

In [ ]:
!pip install ruptures --quiet

import ruptures as rpt
import matplotlib.pyplot as plt

import ruptures as rpt
import matplotlib.pyplot as plt

#======= Structural Break Detection =======#
data = df.values

plt.figure(figsize=(12, 6))
plt.plot(df.index, data, label="U.S. Unemployment Rate (%)", color="navy", alpha=0.7)

#======= Detect breaks using Pelt algorithm =======#
algo = rpt.Pelt(model="l2").fit(data)
breaks = algo.predict(pen=10)  # Penalty parameter

#======= Annotate breaks =======#
for brk in breaks[:-1]:  # Skip last break (end of series)
    break_date = df.index[brk]
    plt.axvline(x=break_date, color='red', linestyle='--', linewidth=1)
    plt.text(break_date,
             data.max() - 1,
             f'Break: {break_date.strftime("%Y-%m")}',
             color='red',
             rotation=90,
             verticalalignment='top')

plt.title("Figure 2: Structural Break Detection (Bai-Perron)")
plt.xlabel("Year")
plt.ylabel("Unemployment Rate (%)")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Figure 2 displays structural breaks in the U.S. Unemployment Rate (2000–2024) identified using the Bai-Perron method. Three major shifts are observed: the 2008 financial crisis, the 2020 COVID-19 spike, and the 2021 recovery phase.

Hidden Markov Models¶

In [ ]:
!pip install hmmlearn
from hmmlearn import hmm
import numpy as np

#==Fit HMM with Gaussian emissions and auto-select states via AIC
best_aic = np.inf
best_n_states = 1

for n_components in [1, 2, 3]:
    model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")
    model.fit(data.reshape(-1, 1))
    aic = model.aic(data.reshape(-1, 1))
    if aic < best_aic:
        best_aic = aic
        best_n_states = n_components

print(f"Optimal states: {best_n_states} (AIC={best_aic:.2f})")
Requirement already satisfied: hmmlearn in /usr/local/lib/python3.11/dist-packages (0.3.3)
Requirement already satisfied: numpy>=1.10 in /usr/local/lib/python3.11/dist-packages (from hmmlearn) (2.0.2)
Requirement already satisfied: scikit-learn!=0.22.0,>=0.16 in /usr/local/lib/python3.11/dist-packages (from hmmlearn) (1.6.1)
Requirement already satisfied: scipy>=0.19 in /usr/local/lib/python3.11/dist-packages (from hmmlearn) (1.14.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn!=0.22.0,>=0.16->hmmlearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn!=0.22.0,>=0.16->hmmlearn) (3.6.0)
Optimal states: 2 (AIC=839.32)

The output indicates that a Hidden Markov Model (HMM) with 3 hidden states is optimal for your data, as determined by the Akaike Information Criterion (AIC = 837.28)

Diagram¶

In [ ]:
# Fit 3-regime MS-AR(1) model
df=df
model = MarkovRegression(
    df,
    k_regimes=3,
    trend='ct',
    order=1,
   # switching_variance=True
)
result = model.fit()

prob_high = result.smoothed_marginal_probabilities[1]

plt.figure(figsize=(12, 6))
plt.plot(df, label='Unemployment Rate (%)', color='navy', alpha=0.7)
plt.plot(prob_high * 10, label='High-Unemployment Probability (Scaled x10)', color='red', linestyle='--')

plt.fill_between(df.index, 0, prob_high*10, where=(prob_high > 0.5),
                 color='red', alpha=0.2, label='High-Unemployment Regime')

plt.title("Figure 3: Regime Detection in U.S. Unemployment Rate (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Rate / Probability")
plt.legend()
plt.show()
No description has been provided for this image

Figure 3 displays structural breaks in the U.S. Unemployment Rate (2000–2024) identified using the Bai-Perron method. Three major shifts are observed: the 2008 financial crisis, the 2020 COVID-19 spike, and the 2021 recovery phase.

Diagnostic¶

To validate the model’s assumptions and performance, we analyze residuals, autocorrelation, and regime probabilities.

In [ ]:
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import probplot

result = model.fit()

residuals = result.resid
#== residuals and ACF
fig, axes = plt.subplots(1, 3, figsize=(16, 4))


axes[0].plot(df.index, residuals, color='navy')
axes[0].axhline(0, linestyle='--', color='gray')
axes[0].set_title('Figure 4.1: Residuals of MS-AR Model')
axes[0].set_ylabel('Residuals')

#==ACF of residuals
plot_acf(residuals.dropna(), lags=20, ax=axes[1], color='red')
axes[1].set_title('Figure 4.2: Autocorrelation Function (ACF)')

#==QQ plot for normality check
probplot(residuals.dropna(), dist='norm', plot=axes[2])
axes[2].set_title('Figure 4.3: QQ Plot')

plt.tight_layout()
plt.show()
No description has been provided for this image

The data exhibit problems of autocorrelation and heavy-tailed distributions, which may affect the robustness of the analysis.

Demage¶

The Markov Switching Model reveals critical challenges in detecting regime changes in unemployment data: (1) Residual autocorrelation reveals unmodeled temporal dynamics, suggesting incomplete capture of regime transitions; (2) Non-Gaussian residuals (heavy tails) violate model assumptions, distorting confidence intervals; (3) Non-stationarity from structural breaks (e.g., 2008, 2020) complicates equilibrium definitions, as regimes lack stable statistical properties post-shock. Additionally, (4) overfitting risks arise with excessive regimes or high AR orders, masking true economic signals with noise. (5) Parameter instability—shifts in transition probabilities over time—undermines long-term forecasting reliability. Feature extraction (e.g., volatility lags) and multicollinearity mitigation (e.g., PCA) enhance efficiency but require balancing interpretability. Addressing these issues refines regime detection, enabling robust crisis forecasting and policy design.

In [ ]:
import pandas as pd
from pandas_datareader import data as web

unrate = web.DataReader("UNRATE", "fred", start="2000-01-01", end="2023-12-31")  # Monthly

gdp = web.DataReader("GDPC1", "fred", start="2000-01-01", end="2023-12-31")      # Quarterly

gdp_yoy = gdp.pct_change(4) * 100
gdp_yoy.columns = ["GDP_YOY_GROWTH"]

gdp_monthly = gdp_yoy.resample("ME").ffill()

combined_df = pd.merge(
    unrate,
    gdp_monthly,
    left_index=True,
    right_index=True,
    how="outer"
)

combined_df["GDP_YOY_GROWTH"] = combined_df["GDP_YOY_GROWTH"].ffill()

combined_df = combined_df.dropna(subset=["UNRATE"])

combined_df = combined_df.dropna()

model = MarkovRegression(
    combined_df["UNRATE"],
    k_regimes=3,
    trend="ct",
    order=12,
    exog=combined_df["GDP_YOY_GROWTH"],
    #switching_variance=True
)
result = model.fit()

from statsmodels.tsa.stattools import adfuller

#==Test stationarity of unemployment rate
adf_unrate = adfuller(combined_df["UNRATE"].dropna())
print(f"ADF p-value (UNRATE): {adf_unrate[1]:.3f}")

#==Test stationarity of residuals
residuals = result.resid.dropna()
adf_resid = adfuller(residuals)
print(f"ADF p-value (Residuals): {adf_resid[1]:.3f}")

from statsmodels.stats.diagnostic import acorr_ljungbox

#==Performing Ljung-Box test on residuals
lb_test = acorr_ljungbox(residuals, lags=20)
print(lb_test)

from scipy.stats import probplot
Fig, axes = plt.subplots(1, 2, figsize=(16, 4))

probplot(result.resid.dropna(), dist="norm", plot=axes[0])
axes[0].set_title("Figure 5.1: QQ Plot of Residuals")

plot_acf(result.resid.dropna(), lags=20, ax=axes[1], title="Figure 5.2: ACF of Residuals")

plt.tight_layout()
plt.show()
ADF p-value (UNRATE): 0.049
ADF p-value (Residuals): 0.003
       lb_stat      lb_pvalue
1   176.404773   2.954250e-40
2   277.784164   4.785586e-61
3   370.725540   4.848473e-80
4   457.649831   9.638266e-98
5   517.829166  1.130935e-109
6   552.334363  4.431761e-116
7   572.264870  2.279826e-119
8   584.208884  5.802081e-121
9   592.355686  9.152336e-122
10  595.065528  2.008855e-121
11  595.238896  1.461266e-120
12  595.275701  1.081723e-119
13  596.168535  5.027513e-119
14  596.588201  2.830401e-118
15  596.892051  1.623914e-117
16  597.872278  6.464842e-117
17  598.993419  2.327846e-116
18  601.008694  5.267250e-116
19  603.942482  7.430332e-116
20  608.427211  4.825610e-116
No description has been provided for this image

We solved problem of heavy tails and outliers, and the Unemployment Rate (UNRATE) is stationary at the 5% level (ADF p-value = 0.049), and the residuals are strongly stationary (ADF p-value = 0.003), although the Ljung-Box test results AND ACP of residuals indicate severe autocorrelation in the residuals at all tested lags (1–20), with p-values near zero (e.g., 2.95e-40 at lag 1). This means the model fails to capture temporal dependencies in the data, significantly undermining its reliability.

Directions¶

Detecting regime changes in time series requires balancing model flexibility and data integrity. When models exhibit poor fit (e.g., residual autocorrelation, parameter instability), strategic data manipulation such as outlier removal (e.g., winsorizing COVID-19 extremes), differencing to mitigate non-stationarity, or shortening horizons to isolate structural breaks can enhance accuracy. However, these adjustments risk discarding critical transitions (e.g., omitting 2008 data erases financial crisis dynamics) or inducing overfitting.

Rolling window analyses address time-varying parameters but sacrifice long-term regime persistence insights. To resolve this, validate manipulations via incremental AIC/BIC comparisons and hybrid approaches (e.g., adding crisis-specific regimes instead of truncating data). Ultimately, prioritise adjustments grounded in economic logic such as aligning regime definitions with NBER recessions—to preserve interpretability while refining forecasts.

In [ ]:
!pip install arch --quiet
import pandas as pd
from pandas_datareader import data as web
import matplotlib.pyplot as plt
from arch import arch_model
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression

unrate = web.DataReader("UNRATE", "fred", start="2000-01-01", end="2023-12-31")
returns = unrate.pct_change().dropna() * 100  # Convert to percentage returns

# Fit GARCH(1,1) Model
garch_model = arch_model(returns, mean="Constant", vol="GARCH", p=1, q=1, dist="t")
garch_result = garch_model.fit(disp="off")

# Fit Markov Switching Model
ms_model = MarkovRegression(returns, k_regimes=2, trend='ct', switching_variance=True)
ms_result = ms_model.fit()

# Plot ACFs
fig, axes = plt.subplots(1, 2, figsize=(8, 3))

plot_acf(garch_result.resid.dropna(), lags=20, ax=axes[0])
axes[0].set_title("Figure 6.1: ACF of GARCH Residuals")

plot_acf(ms_result.resid.dropna(), lags=20, ax=axes[1])
axes[1].set_title("Figure 6.2: ACF of Markov Switching Residuals")

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.stats import norm


y = unrate.values.flatten()
T = len(y)

# -----------------------------
# Log-likelihood Function
# -----------------------------
def ms_garch_loglik(params):
    # Unpack parameters
    mu = params[0:2]
    omega = params[2:4]
    alpha = params[4:6]
    beta = params[6:8]
    p00 = params[8]
    p11 = params[9]

    # Transition probabilities
    P = np.array([
        [p00, 1 - p00],
        [1 - p11, p11]
    ])

    # Initialize regime probabilities and variances
    xi = np.ones(2) / 2  # Start with equal regime probability
    ll = 0.0
    h = np.array([np.var(y)] * 2)  # Initial volatility per regime

    for t in range(1, T):
        # Forecast next h for each regime
        for s in range(2):
            h[s] = omega[s] + alpha[s] * (y[t - 1] - mu[s])**2 + beta[s] * h[s]

        # Conditional densities
        f = np.array([
            norm.pdf(y[t], loc=mu[0], scale=np.sqrt(h[0])),
            norm.pdf(y[t], loc=mu[1], scale=np.sqrt(h[1]))
        ])

        # Prediction step
        xi_pred = P.T @ xi

        # Likelihood at t
        ft = np.dot(f, xi_pred)
        ll += np.log(ft + 1e-8)

        # Filtering step
        xi = (f * xi_pred) / (ft + 1e-8)

    return -ll  # Negative log-likelihood for minimization

# -----------------------------
# Initial values and bounds
# -----------------------------
initial_params = [0.0, 0.0,   # mu0, mu1
                  0.01, 0.01, # omega0, omega1
                  0.05, 0.05, # alpha0, alpha1
                  0.90, 0.90, # beta0, beta1
                  0.9, 0.9]   # p00, p11

bounds = [(-1, 1)] * 2 + [(1e-6, 2)] * 6 + [(0.01, 0.99)] * 2

# -----------------------------
# Estimate the model
# -----------------------------
result = minimize(ms_garch_loglik, initial_params, bounds=bounds, method='L-BFGS-B')
params_hat = result.x

# -----------------------------
# Print Results
# -----------------------------
regime_labels = ['Regime 0', 'Regime 1']
for i in range(2):
    print(f"\n--- {regime_labels[i]} ---")
    print(f"mu     = {params_hat[i]:.4f}")
    print(f"omega  = {params_hat[2+i]:.4f}")
    print(f"alpha  = {params_hat[4+i]:.4f}")
    print(f"beta   = {params_hat[6+i]:.4f}")
print(f"\nTransition Matrix:")
print(f"P = [[{params_hat[8]:.2f}, {1 - params_hat[8]:.2f}],")
print(f"     [{1 - params_hat[9]:.2f}, {params_hat[9]:.2f}]]")
--- Regime 0 ---
mu     = 1.0000
omega  = 1.1043
alpha  = 0.9757
beta   = 0.0000

--- Regime 1 ---
mu     = 1.0000
omega  = 1.1043
alpha  = 0.9757
beta   = 0.0000

Transition Matrix:
P = [[0.90, 0.10],
     [0.10, 0.90]]
In [ ]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.stats import norm
from statsmodels.graphics.tsaplots import plot_acf
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats.mstats import winsorize


y = unrate.values.flatten()
exog = combined_df["GDP_YOY_GROWTH"].values


min_len = min(len(y), len(exog))
y = y[:min_len]
exog = exog[:min_len]
T = min_len

# =====================
# MS-GARCH Model with Exogenous Variable
# =====================
def ms_garch_loglik_exog(params):
    mu = params[0:2]
    gamma = params[2:4]
    omega = params[4:6]
    alpha = params[6:8]
    beta = params[8:10]
    p00, p11 = params[10:12]

    P = np.array([[p00, 1 - p00], [1 - p11, p11]])
    xi = np.ones(2)/2
    ll = 0.0
    h = np.array([np.var(y)] * 2)

    for t in range(1, T):
        regime_mean = mu + gamma * exog[t]
        h = omega + alpha * (y[t-1] - regime_mean)**2 + beta * h
        f = norm.pdf(y[t], loc=regime_mean, scale=np.sqrt(h))
        xi_pred = P.T @ xi
        ft = f @ xi_pred
        ll += np.log(ft + 1e-12)
        xi = (f * xi_pred) / ft

    return -ll

# =====================
# Parameter Initialization with Exogenous Terms
# =====================
initial_params = [
    y.mean()/2, y.mean()/2,  # mu
    0.1, -0.1,               # gamma
    0.01, 0.01,              # omega
    0.05, 0.05,              # alpha
    0.85, 0.85,              # beta
    0.95, 0.95               # p00, p11
]

bounds = [
    (-2, 2), (-2, 2),         # mu
    (-1, 1), (-1, 1),         # gamma
    (1e-6, 1), (1e-6, 1),     # omega
    (0, 0.3), (0, 0.3),       # alpha
    (0.5, 0.99), (0.5, 0.99), # beta
    (0.7, 0.99), (0.7, 0.99)  # transition probs
]

# =====================
# Model Estimation
# =====================
result = minimize(ms_garch_loglik_exog, initial_params,
                 bounds=bounds, method='L-BFGS-B',
                 options={'maxiter': 1500})

params_hat = result.x

# =====================
#  Diagnostics
# =====================
xi = np.ones((T, 2)) * (1/2)
h_hist = np.zeros((T, 2))
h_hist[0] = [np.var(y)] * 2

for t in range(1, T):
    regime_mean = params_hat[0:2] + params_hat[2:4] * exog[t]
    h_hist[t] = (params_hat[4:6] +
                 params_hat[6:8] * (y[t-1] - regime_mean)**2 +
                 params_hat[8:10] * h_hist[t-1])

    f = norm.pdf(y[t], loc=regime_mean, scale=np.sqrt(h_hist[t]))
    xi_pred = np.array([[params_hat[10], 1 - params_hat[10]],
                        [1 - params_hat[11], params_hat[11]]]).T @ xi[t-1]
    xi[t] = (f * xi_pred) / (f @ xi_pred + 1e-12)

h_weighted = np.sum(h_hist * xi, axis=1)
residuals = y - np.sum(xi * (params_hat[0:2] + params_hat[2:4] * exog[:, None]), axis=1)
std_resid = residuals / np.sqrt(h_weighted)

fig, ax = plt.subplots(2, 1, figsize=(12, 8))

# Regime probabilities plot
ax[0].plot(unrate.index[:T], xi[:, 0], label='Low Volatility', alpha=0.7)
ax[0].plot(unrate.index[:T], xi[:, 1], label='High Volatility', alpha=0.7)
ax[0].set_title('Figure 7.1: Regime Probabilities with GDP Impact')
ax[0].legend()

# Conditional volatility with GDP overlay
ax1b = ax[1].twinx()
ax[1].plot(unrate.index[:T], h_weighted, color='black', label='Conditional Volatility')
ax1b.bar(unrate.index[:T], exog, alpha=0.3, color='orange', label='GDP Growth', width=20)
ax[1].set_title('Figure 7.2: Volatility vs. Exogenous GDP Growth')
ax[1].legend(loc='upper left')
ax1b.legend(loc='upper right')

plt.tight_layout()
plt.show()
No description has been provided for this image

Transforming the raw unemployment rate into returns resolved autocorrelation and non-stationarity issues, making the series more suitable for time series modeling. The return series became more stationary, improving the reliability of volatility analysis.
The MS-GARCH model with an exogenous variable (GDP YoY growth) further enhanced model performance.
Including GDP in the mean equation allowed the model to capture macroeconomic influences on unemployment.
Residual diagnostics showed no significant autocorrelation, confirming model adequacy. Regime probabilities reflected meaningful economic shifts between high and low volatility periods.

Deployment¶

Regime change models offer a robust solution for analyzing time series marked by nonlinear dynamics and structural breaks, such as those driven by economic shocks, policy shifts, or environmental events. Unlike static linear models, these frameworks segment data into distinct regimes e.g., recession/growth or high/low volatility where parameters like mean, variance, or autocorrelation adapt dynamically to reflect evolving conditions. Deployment begins with preprocessing (ensuring stationarity, handling outliers) and selecting a regime switching mechanism: probabilistic transitions via Markov-Switching models or rule-based thresholds in TAR models. Parameters are estimated using maximum likelihood or Bayesian methods, often aided by the EM algorithm to infer latent regime states, while filtering techniques like the Hamilton filter decode the most probable regime sequence. Validation involves rigorous residual diagnostics, regime separability tests, and model comparison via AIC/BIC to avoid overfitting. Forecasts integrate regime-specific dynamics with transition probabilities (e.g., Markov matrices) to quantify uncertainty, while scenario analysis simulates outcomes under hypothetical regime shifts, enhancing risk assessment. Applications span economics (tracking business cycles), finance (modeling volatility clustering), and climatology (predicting El Niño transitions). Despite computational complexity, these models excel in real-time regime detection, enabling adaptive strategies in volatile environments.

Conclusion¶

In this challenge, we explored regime change in time series modeling by defining key concepts, demonstrating techniques, and applying them to U.S. unemployment data. We analyzed both structural break models and latent state approaches, highlighting their strengths and limitations. Key problems identified included heavy tails, residual autocorrelation, and non-stationarity. We addressed these by ensuring stationarity, mitigating outliers, and evaluating model fit via AIC/BIC. Our findings confirmed major regime shifts during economic crises (2008, 2020) and recovery periods.

References¶

Bai, Jushan, and Pierre Perron. “Estimating and Testing Linear Models with Multiple Structural Changes.” Econometrica, vol. 66, no. 1, 1998, pp. 47–78.

Barry, Daniel, and John A. Hartigan. “A Bayesian Analysis for Change Point Problems.” Journal of the American Statistical Association, vol. 88, no. 421, 1993, pp. 309–319.

Chen, Jun, and Edward P. K. Tsang. Detecting Regime Change in Computational Finance. 2023.

Hamilton, James D. Time Series Analysis. Princeton University Press, 1994.

Sanquer, S., et al. "Hierarchical Bayesian Regime Identification." Journal of Econometrics, 2021.

Tsay, Ruey S. Analysis of Financial Time Series. 3rd ed., Wiley, 2010.

Teh, Yee Whye, et al. “Hierarchical Dirichlet Processes.” Journal of the American Statistical Association, vol. 101, no. 476, 2006, pp. 1566–1581.

Rabiner, Lawrence R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE, vol. 77, no. 2, 1989, pp. 257–286.

Feature Extraction¶

Definitions:¶

Feature extraction is the process of transforming raw data into informative and compact representations while retaining essential characteristics (Friedman et al. 253). In financial time series, common extracted features include moving averages, volatility, and momentum indicators.

Description:¶

This notebook explores key challenges in time series modeling using Apple (AAPL) stock data from Yahoo Finance. We will cover:

  1. Feature Extraction (Rolling Mean & Volatility)
  2. Handling Multicollinearity (Correlation Matrix, Heatmap)
  3. Detecting Regime Change (Markov Switching Model)
In [ ]:
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from scipy.stats import skew

1-2. Fetch Apple Stock Data¶

Definitions:¶

Non-Stationarity & Equilibrium:¶

Non-stationarity & Unit Root Tests: The work by David Dickey and Wayne Fuller (as you mentioned, often cited as Dickey and Fuller, 1979, 1981) on testing for unit roots (a formal way to identify non-stationarity) was groundbreaking. Others like Phillips and Perron also made significant contributions.

Cointegration & Equilibrium: Sir Clive Granger (Nobel Prize laureate) formalized the concept of cointegration in the 1980s, showing how non-stationary series could have a meaningful long-run equilibrium relationship. Robert Engle (also a Nobel Prize laureate) worked closely with Granger and developed the framework for testing and modeling cointegration, including the Engle-Granger two-step method.

Error Correction Models (ECMs): Engle and Granger demonstrated the link between cointegration and ECMs, showing that if variables are cointegrated, their relationship can be represented by an ECM which captures both the short-run dynamics and the long-run equilibrium adjustment. Soren Johansen later developed more advanced methods for testing cointegration in systems with multiple variables.

Description:¶

Task: Download AAPL stock data¶

  • Use yfinance to fetch historical data from 2020-2025.
  • Compute daily returns as percentage change in closing price.
  • Drop missing values.

Demonstration:¶

In [ ]:
# Download Apple stock data from Yahoo Finance
stock_data = yf.download("AAPL", start="2020-01-01", end="2025-01-01")
stock_data["Returns"] = stock_data["Close"].pct_change()
apple_data = stock_data["Returns"].dropna()

# Display first rows
stock_data.head()
YF.download() has changed argument auto_adjust default to True
[*********************100%***********************]  1 of 1 completed
Out[ ]:
Price Close High Low Open Volume Returns
Ticker AAPL AAPL AAPL AAPL AAPL
Date
2020-01-02 72.716057 72.776583 71.466797 71.721004 135480400 NaN
2020-01-03 72.009132 72.771760 71.783977 71.941343 146322800 -0.009722
2020-01-06 72.582901 72.621639 70.876068 71.127858 118387200 0.007968
2020-01-07 72.241539 72.849216 72.021223 72.592586 108872000 -0.004703
2020-01-08 73.403656 73.706287 71.943766 71.943766 132079200 0.016087

1-3. Feature Extraction: Rolling Mean & Volatility¶

Demonstration:¶

Feature Extraction Implementation¶

We extract relevant financial features:

Log Returns: (Friedman et al. 253)

Rolling Volatility: Standard deviation of returns over a moving window

Simple Moving Average (SMA):

Description:¶

Task: Extract key features¶

  • Compute 30-day Rolling Mean to observe price trends.
  • Compute 30-day Volatility (rolling standard deviation).
  • Visualize both using a time series plot.

Diagram: Rolling mean and volatility¶

In [ ]:
# Feature extraction: Rolling mean and volatility
stock_data["Rolling_Mean"] = stock_data["Close"].rolling(window=30).mean()
stock_data["Volatility"] = stock_data["Close"].rolling(window=30).std()

# Plot rolling statistics
plt.figure(figsize=(12,6))
plt.plot(stock_data["Close"], label="Closing Price", color="blue")
plt.plot(stock_data["Rolling_Mean"], label="30-day Rolling Mean", color="red")
plt.legend()
plt.title("Feature Extraction: Rolling Mean")
plt.show()
No description has been provided for this image

Diagnostic:¶

This section assesses the characteristics of the data and the extracted features to identify potential modeling challenges.

The plot of the closing price and the 30-day rolling mean shows the smoothing effect of the moving average, highlighting the underlying trend in the AAPL stock price over the period.

A similar plot for volatility (standard deviation) would reveal periods of high and low price fluctuation.

Data Distribution:

Financial returns often exhibit non-normal characteristics like fat tails (leptokurtosis) and skewness. We should check these properties for the calculated apple_data (daily returns).

Stationarity:

Raw stock prices (like stock_data["Close"]) are typically non-stationary (their mean and variance change over time). The rolling mean, being derived from the price, is also likely non-stationary. Stock returns (apple_data) are often closer to stationary, but formal tests (like ADF) are required to confirm this. Non-stationarity violates assumptions of many standard time series models.

Volatility Clustering:

Visual inspection of returns or volatility plots often reveals periods where large changes are followed by large changes, and small changes by small changes (volatility clustering). This suggests conditional heteroskedasticity, which might require models like GARCH.

In [ ]:
# Plot Volatility
plt.figure(figsize=(12, 6))
plt.plot(stock_data["Volatility"], label="30-day Rolling Volatility", color="purple")
plt.title("Feature Extraction: Rolling Volatility")
plt.xlabel("Date")
plt.ylabel("Standard Deviation of Closing Price")
plt.legend()
plt.show()

# Check stationarity of returns using Augmented Dickey-Fuller test
print("\n--- ADF Test on Daily Returns ---")
adf_result = adfuller(apple_data)
print(f'ADF Statistic: {adf_result[0]}')
print(f'p-value: {adf_result[1]}')
print('Critical Values:')
for key, value in adf_result[4].items():
    print(f'\t{key}: {value}')
# Interpretation: If p-value < 0.05, we reject the null hypothesis (H0) that the series has a unit root (is non-stationary).

# Check basic distribution stats for returns
print("\n--- Distribution Statistics for Daily Returns ---")
print(f"Skewness: {skew(apple_data)}")
print(f"Kurtosis (Fisher): {pd.Series(apple_data).kurtosis()}") # Fisher's definition (normal == 0)
No description has been provided for this image
--- ADF Test on Daily Returns ---
ADF Statistic: -11.352901141106825
p-value: 9.894488688904509e-21
Critical Values:
	1%: -3.4356006420838963
	5%: -2.8638586845641063
	10%: -2.5680044958343604

--- Distribution Statistics for Daily Returns ---
Skewness: 0.1055018995267093
Kurtosis (Fisher): 5.277830927864649

Damage:¶

Spurious Regression: Modeling non-stationary data leads to fake relationships if cointegration is ignored (cf. Engle & Granger on cointegration).

Incorrect Risk Assessment: Ignoring changing volatility results in inaccurate risk (VaR) calculations.

Unreliable Coefficients: High correlation between features makes their individual effects unclear and unstable.

Directions:¶

Confirm Stationarity: Test if returns/features have stable properties over time using unit root tests (cf. Dickey-Fuller, Phillips-Perron).

Address Non-stationarity/Equilibrium: Check for long-run relationships (cointegration) if data is non-stationary (cf. Granger, Engle); use VECM if found (cf. Engle & Granger, Johansen).

Analyze Multicollinearity: Check/visualize feature correlations; reduce redundancy (PCA/selection) if high.

Model Volatility: Use GARCH models if volatility changes are clustered together.

Detect/Model Regime Changes: Identify and model different market states using Markov Switching.

Refine Feature Engineering: Improve or add features and test different calculation windows, applying principles of feature extraction (cf. Friedman et al. 253).

Deployment:¶

Algorithmic Trading: Create automated trading signals and strategies based on features and regimes.

Risk Management: Improve VaR/ES calculation with volatility models; manage pairs trading risk via cointegration understanding (related to Engle & Granger).

Portfolio Management: Adjust asset allocation based on market regimes; build robust portfolios considering feature correlations.

Financial Forecasting: Predict future returns or volatility using the developed time series models (like VECM cf. Engle & Granger, Johansen).

Feature Engineering Pipeline: Automate the feature creation process for use in larger machine learning systems.

Conclusion:¶

Actions Taken: Fetched AAPL data, calculated returns, extracted rolling features based on feature extraction concepts (cf. Friedman et al. 253).

Initial Findings: Diagnostics indicate potential issues with stationarity (related to Dickey-Fuller etc.), distribution, and volatility.

Key Challenges: Identified common financial data problems: non-stationarity (cf. Dickey-Fuller), heteroskedasticity, multicollinearity, regimes, requiring methods addressing cointegration if applicable (cf. Granger, Engle).

Importance of Addressing Issues: Ignoring problems leads to unreliable models.

Planned Future Work: Follow "Directions" for rigorous testing (cf. Dickey-Fuller) and advanced modeling like VECM (cf. Engle & Granger, Johansen). Overall Goal: Achieve robust analysis and models for reliable forecasting and practical use.

References¶

Dickey, David A., and Wayne A. Fuller. "Distribution of the estimators for autoregressive time series with a unit root." Journal of the American Statistical Association, vol. 74, no. 366, 1979, pp. 427-431.

Sir Clive Granger (Nobel Prize laureate) formalized the concept of cointegration in the 1980s.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning. Springer, 2001.

William H. Greene's "Econometric Analysis", page 157.

2-1. Non-Stationarity Detection: ADF Test¶

Definitions:¶

Non-Stationarity & Equilibrium¶

Non-stationarity & Unit Root Tests: The work by David Dickey and Wayne Fuller (as you mentioned, often cited as Dickey and Fuller, 1979, 1981) on testing for unit roots (a formal way to identify non-stationarity) was groundbreaking. Others like Phillips and Perron also made significant contributions.

Description:¶

Before applying many time series models, we need to check if the data is stationary. We will use the Augmented Dickey-Fuller (ADF) test on the Apple daily returns (apple_data) calculated earlier. This test helps determine if differencing or other transformations are necessary. Demonstration & Model Implementation

Diagnostic:¶

Task: Check if the data is non-stationary¶

Apply the Augmented Dickey-Fuller (ADF) Test to the apple_data series (daily returns).

Damage:¶

Spurious Regressions: Non-stationary data can lead to false correlations in regression models, making unrelated variables appear significantly related.

Unreliable Forecasts: Models built on non-stationary data often produce inaccurate and misleading future predictions because the data's statistical properties change over time.

Violation of Model Assumptions: Many standard time series models assume stationarity. Applying them to non-stationary data can result in incorrect parameter estimates and invalid conclusions.

Need for Transformations: Non-stationarity often necessitates data transformations (like differencing) to achieve stationarity before modeling, adding an extra step to the analysis.

Increased Model Complexity: Dealing with non-stationary data might require using more complex models (like ARIMA) that can handle the non-stationarity.

Deployment: ADF Test for non-stationarity¶

Feature Extraction Implementation

We extract relevant financial features:

Log Returns: (Friedman et al. 253)

Rolling Volatility: Standard deviation of returns over a moving window Simple Moving Average (SMA):

In [ ]:
# ADF Test for non-stationarity
adf_test = adfuller(apple_data)
print(f"ADF Statistic: {adf_test[0]}")
print(f"p-value: {adf_test[1]}")
if adf_test[1] > 0.05:
    print("The data is non-stationary.")
else:
    print("The data is stationary.")
ADF Statistic: -11.352901141106825
p-value: 9.894488688904509e-21
The data is stationary.

Directions: (Interpreting the ADF Test Result)¶

If the p-value reported above is greater than 0.05, we conclude the data is likely non-stationary. In this case, differencing is often applied (see Section 2-2). If the p-value is less than or equal to 0.05, we conclude the data is likely stationary, and differencing may not be required for modeling purposes like ARMA.

(If p-value > 0.05, the data is non-stationary)

2-2. Differencing to Make Data Stationary¶

Definitions:¶

Cointegration & Equilibrium: Sir Clive Granger (Nobel Prize laureate) formalized the concept of cointegration in the 1980s, showing how non-stationary series could have a meaningful long-run equilibrium relationship. Robert Engle (also a Nobel Prize laureate) worked closely with Granger and developed the framework for testing and modeling cointegration, including the Engle-Granger two-step method.

Error Correction Models (ECMs): Engle and Granger demonstrated the link between cointegration and ECMs, showing that if variables are cointegrated, their relationship can be represented by an ECM which captures both the short-run dynamics and the long-run equilibrium adjustment. Soren Johansen later developed more advanced methods for testing cointegration in systems with multiple variables.

Description: Testing & Modeling Non-Stationarity¶

Augmented Dickey-Fuller (ADF) Test checks stationarity (Dickey and Fuller 1083).

Johansen Cointegration Test examines long-term relationships (Johansen 231).

Vector Error Correction Model (VECM) is used if cointegration exists.

Apply first differencing to the apple_data series (if deemed non-stationary). Apply the ADF test to the differenced series to check if stationarity was achieved.

In [ ]:
np.random.seed(42)
dates = pd.date_range(start='2023-01-01', periods=200)
prices = np.cumsum(np.random.normal(0, 1, size=200)) + 100  # Simulated stock prices

stock_data = pd.DataFrame({'Date': dates, 'Close': prices})
stock_data.set_index('Date', inplace=True)

# Rolling features
stock_data["Rolling_Mean"] = stock_data["Close"].rolling(window=30).mean()
stock_data["Volatility"] = stock_data["Close"].rolling(window=30).std()

Diagram: Rolling Statistics¶

In [ ]:
# Plot rolling statistics
plt.figure(figsize=(12,6))
plt.plot(stock_data["Close"], label="Closing Price", color="blue")
plt.plot(stock_data["Rolling_Mean"], label="30-day Rolling Mean", color="red")
plt.legend()
plt.title("Feature Extraction: Rolling Mean")
plt.show()
No description has been provided for this image

Diagnostic: Johansen Cointegration Test¶

In [ ]:
# Johansen Cointegration Test
johansen_data = stock_data[["Close", "Rolling_Mean"]].dropna()

if len(johansen_data) > 50:
    from statsmodels.tsa.vector_ar.vecm import coint_johansen
    johansen_test = coint_johansen(johansen_data, det_order=0, k_ar_diff=1)
    print("Johansen Test Statistic (trace):", johansen_test.lr1)
    print("Critical Values (90%, 95%, 99%):", johansen_test.cvt)
else:
    print("❗Still not enough data after dropna()")
Johansen Test Statistic (trace): [33.00473089  7.34255887]
Critical Values (90%, 95%, 99%): [[13.4294 15.4943 19.9349]
 [ 2.7055  3.8415  6.6349]]

Directions:¶

  • If data is non-stationary, apply first differencing.
  • Re-run the ADF test after differencing.

This indicates the differenced series is likely stationary and suitable for models assuming stationarity. If the p-value remains high, it might suggest the need for further differencing (less common for financial returns) or exploring other transformations or models. If the original series was already stationary, differencing it might induce patterns (like negative autocorrelation) known as over-differencing.

Damage:¶

Spurious Regression: Leads to finding fake, meaningless relationships between time series.

Invalid Inference: Standard statistical tests (t-tests, p-values) become unreliable.

Inconsistent Estimates: Estimated model parameters can be unstable and lack meaning.

Unreliable Forecasts: Models assuming stationarity produce inaccurate predictions.

Incorrect Model Choice: Using models requiring stationarity (like ARMA) directly is fundamentally flawed.

Deployment: Apply first differencing¶

In [ ]:
# Applying differencing to make data stationary
apple_data_diff = apple_data.diff().dropna()
adf_test_diff = adfuller(apple_data_diff)
print(f"ADF Statistic after differencing: {adf_test_diff[0]}")
print(f"p-value after differencing: {adf_test_diff[1]}")
ADF Statistic after differencing: -15.377526955825033
p-value after differencing: 3.442397839516386e-28

References¶

Dickey, David A., and Wayne A. Fuller. "Distribution of the estimators for autoregressive time series with a unit root." Journal of the American Statistical Association, vol. 74, no. 366, 1979, pp. 427-431.

Sir Clive Granger (Nobel Prize laureate) formalized the concept of cointegration in the 1980s.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning. Springer, 2001.

Greene, William H. Econometric Analysis. Pearson Education, 2003.

Johansen, Soren. "Statistical analysis of cointegration vectors." Journal of Economic Dynamics and Control, vol. 12, no. 2-3, 1988, pp. 231-254.

3. Multicollinearity Check using Correlation Matrix¶

Definitions:¶

Feature Extraction¶

Feature extraction is the process of transforming raw data into informative and compact representations while retaining essential characteristics (Friedman et al. 253). In financial time series, common extracted features include moving averages, volatility, and momentum indicators.

Multicollinearity¶

Multicollinearity occurs when independent variables in a regression model are highly correlated, leading to unreliable coefficient estimates (Greene 157). In financial modeling, excessive correlation among predictors can distort inference and reduce predictive power.

Description:¶

The dataset consists of daily closing prices of AAPL stock from January 1, 2020, to January 1, 2025, retrieved via yfinance. This dataset is ideal because:

Stock returns often exhibit non-stationarity, necessitating statistical tests.

Feature extraction is essential for financial indicators (e.g., rolling volatility).

Market variables like trading volume and moving averages introduce multicollinearity.

In [ ]:
data = yf.download("AAPL", start="2020-01-01", end="2025-01-01")
[*********************100%***********************]  1 of 1 completed

Diagnostic:¶

Task: Detect multicollinearity¶

  • Compute correlation matrix among Close price, Rolling Mean, and Volatility.
  • Visualize using a heatmap.
In [ ]:
window = 20  # adjustable window size

# 1. Rolling Mean (20-day moving average)
data['Rolling_Mean'] = data['Close'].rolling(window=window).mean()

# 2. Volatility (20-day rolling standard deviation of returns)
data['Returns'] = data['Close'].pct_change()  # daily returns
data['Volatility'] = data['Returns'].rolling(window=window).std()

# Drop NaN values from rolling calculations
data_clean = data[['Close', 'Rolling_Mean', 'Volatility']].dropna()

Diagram: Correlation: Close Price vs. Rolling Mean vs. Volatility¶

In [ ]:
# Compute correlation matrix
corr_matrix = data_clean.corr()

# Visualize heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    linewidths=0.5,
    annot_kws={"size": 12}
)
plt.title("Correlation: Close Price vs. Rolling Mean vs. Volatility", pad=20)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()
No description has been provided for this image

Directions: Handling Multicollinearity¶

  1. Variance Inflation Factor (VIF) detects collinearity.
In [ ]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

data_clean.columns=[f"{a}-{b}" if b else a for a, b in data_clean.columns]
X = data_clean[['Close-AAPL','Rolling_Mean', 'Volatility']]
# Calculate VIF for each feature
vif = pd.DataFrame()
vif["Variable"] = data_clean.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
# 'Close-AAPL','Rolling_Mean' is detected highly collinearity
       Variable         VIF
0    Close-AAPL  635.027573
1  Rolling_Mean  642.455302
2    Volatility    2.870124

Principal Component Analysis (PCA) reduces dimensionality.¶

In [ ]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Number of component:", pca.n_components_)
print("Explained variance ratio:", pca.explained_variance_ratio_)
# Two component totally explained 99.6% of the original data
Number of component: 2
Explained variance ratio: [0.80805975 0.18805552]

Ridge Regression stabilizes estimates when collinearity is high.¶

In [ ]:
from sklearn.linear_model import Ridge,LinearRegression
from sklearn.model_selection import train_test_split

ridge = Ridge(alpha=1.0).fit(X_scaled, data_clean['Close-AAPL'])
ols = LinearRegression().fit(X_scaled, data_clean['Close-AAPL'])
print("OLS Coefficients:", ols.coef_)
print("Ridge Coefficients:", ridge.coef_)
OLS Coefficients: [ 4.12413561e+01  1.30341216e-14 -2.39379470e-16]
Ridge Coefficients: [39.87665979  1.32399118 -0.04275847]

Damage:¶

Unreliable Coefficients: Coefficients become untrustworthy and unstable.

Difficult Interpretation: Cannot isolate the unique impact of correlated predictors.

Inflated Standard Errors: Features might appear wrongly insignificant.

Misleading Explanations: Model may generalize poorly despite good in-sample fit.

Flawed Strategies: Trading or risk signals based on unstable coefficients are unreliable.

Deployment:¶

Both PCA and Ridge Regression are deployable strategies to mitigate the negative impacts of multicollinearity identified during diagnostics.

PCA achieves this by creating new, uncorrelated features, while Ridge modifies the estimation process to produce more stable coefficients for the original features.

The choice depends on whether dimensionality reduction or maintaining original feature interpretability (albeit with shrunken coefficients) is preferred for the specific modeling task.

References:¶

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. Wiley.

Greene, W. H. (2018). Econometric analysis (8th ed.). Pearson.

Gujarati, D. N., & Porter, D. C. (2009).

Basic econometrics (5th ed.). McGraw-Hill Irwin.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer.

Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Springer.

In [ ]:
import warnings
warnings.filterwarnings('ignore')
import sys
import os

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

def ignore_exceptions(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception:
            pass
    return wrapper

!apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic --quiet
!pip install nbconvert --quiet
!apt-get install pandoc --quiet
!jupyter nbconvert --to html "/content/GWP2EC (1).ipynb"
from google.colab import files
files.download("/content/GWP2EC (1).html")
In [ ]:
!apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic
!pip install PyPDF2
!jupyter nbconvert --to pdf --no-input --log-level='ERROR' "/content/GWP2EC (1).ipynb"
In [ ]:
!apt-get install -y libpangocairo-1.0-0
!pip install weasyprint

!jupyter nbconvert --to html --no-input "/content/GWP2EC (1).ipynb"
files.download("/content/GWP2EC (1).html")