Introduction¶

In the intricate domain of derivatives trading and risk management, accurate volatility modeling stands as a cornerstone for pricing instruments, assessing risk, and ensuring regulatory compliance. The reliability of these models hinges on a nuanced understanding of the statistical properties governing financial data properties such as skewness, kurtosis, heteroscedasticity, outlier sensitivity, and temporal misalignments in time series. This work is structured to systematically address these critical elements, bridging theoretical foundations with pragmatic solutions to enhance model robustness and decision-making efficacy.

First, we explore kurtosis (Section 1) and heteroscedasticity (Section 2). Next, we analyze skewness (Section 3), followed by a discussion of sensitivity to outliers (Section 4). Lastly, we confront the complexities of integrating time series with mismatched frequencies, such as combining daily stock prices and quarterly earnings (Section 5). In each section, we provide definitions, detailed descriptions, example demonstrations, visualizations (plots/diagrams), diagnostic methods, and an analysis of the risks posed by each issue.

Kurtosis/Heteroscedasticity¶

1.1 Kurtosis¶

Before defining kurtosis, it is essential to understand the concept of moments in statistics.

Moments: In statistics, moments are quantitative measures (numerical summaries) that describe key characteristics of a probability distribution or dataset, central tendency (mean), dispersion (variance), asymmetry (skewness), and tail behavior (kurtosis).

Let $\{X_{i}\}_{i=1}^{N}$ be a set of random variables. Moments can be classified as either population moments ($\mu_r$) or sample moments ($m_r$). All of them cam be classified into three types: raw moments, centered moments, and standardized moments (Prokop et al. 439 - 443).

a) Raw moments can be defined as $\mu_{r} = E(X^r), \text{and } m_r= \frac{1}{n} \sum_{i=1}^{n} (x_{i}^r)$, where $r$ is the order of the moment. We observe that for $r = 1$: $\mu_{1} = E(X) = \mu$ and $m_1= \frac{1}{n} \sum_{i=1}^{n} x_{i} = \bar{x}$, that is, the first population (sample) raw moment is the population (sample) mean.

b) Centered moments measure the deviation of values from the mean and are defined as: $\mu_{r} = E[(X - \mu)^r], \text{ and } m_r= \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x} )^r$, where $\mu$ is mean.

If $r = 2$, then:$ \begin{aligned} \mu_{2} &= E[(X - \mu)^2]= E\left[X^2 - 2\mu X + \mu^2\right]=E(X^2) - 2\mu E(X) + E(\mu^2)=E(X^2) - 2E(X) E(X) + \mu^2\\ &=E(X^2) - 2E^2(X) + E^2(X)=E(X^2) - E^2(X)=Var(X)\\ m_2 &= \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x} )^2=Var(x) \quad \text{(sample variance without bias correction)} \end{aligned} $

c) Standardized moments are dimensionless quantities obtained by dividing the centered populational (sample) moments by the populational (sample - $s$) standard deviation raised to the corresponding power $r$: $\tilde{\mu}_{r} = \frac{\mu_{r}}{\sigma^r} \text{ and }\tilde{m}=\frac{m_r}{s^r}.$

1.1.1 Definition¶

Kurtosis quantifies the tailedness of a distribution, not peakedness, a common misconception. As Anderson and Kenton (2024) note, "A distribution can be infinitely peaked with low kurtosis, and a distribution can be perfectly flat-topped with infinite kurtosis". While early interpretations (Lee 425, Finucan 112) associated kurtosis with peak sharpness, modern research emphasizes that it reflects tail heaviness and the likelihood of outliers (Bai and Serena 49-50).

Kurtosis is defined as the 4th standardized population moment (Balanda, Helen 7, Ruppert 1-2):
$K(X) = \tilde{\mu}_{4} = \frac{\mu_{4}}{\sigma^4}=\frac{E[(X - \mu)^4]}{\sigma^4} = E \left[ \frac{(X - \mu)}{\sigma} \right]^4.$ For a sample, kurtosis is estimated using the 4th standardized sample moment: $ \hat{K}(X) = \tilde{m}_{4} = \frac{m_{4}}{s^4}=\frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_{i} - \bar{x}}{s}\right)^4, \: s^2=\frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x} )^2. $

1.1.2 Excess Kurtosis¶

Kallner (245) defines excess kurtosis as: $\hat{K}(X) - 3$, where, a) $\hat{K}(X) = 3$: The distribution is mesokurtic (the normal distribution) (Figure 1.1); b) $\hat{K}(X) > 3$: The distribution is leptokurtic (lepto meaning "thin")(Figure 1.1) and c) $\hat{K}(X) < 3$: The distribution is platykurtic (platy meaning "broad").

1.1.3 Multivariate extension of Kurtosis: For multivariate distributions, kurtosis can be extended using methods outlined by Mardia (115–128).¶

In [475]:
!pip install numpy scipy matplotlib statsmodels pandas --quiet
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import warnings
warnings.filterwarnings('ignore')
import sys
import os

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

def ignore_exceptions(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception:
            pass
    return wrapper


#--------------------------------
np.random.seed(42)
x = np.linspace(-4, 4, 1000)

# Function to compute sample kurtosis
def compute_kurtosis(data):
    n = len(data)
    mean = np.mean(data)
    std_dev = np.std(data, ddof=1)
    return np.sum(((data - mean) / std_dev) ** 4) / n


normal_data = np.random.normal(0, 1, 10000)
normal_pdf = stats.norm.pdf(x, 0, 1)
normal_kurtosis = compute_kurtosis(normal_data)


leptokurtic_data = np.random.laplace(0, 0.5, 10000)
leptokurtic_pdf = stats.laplace.pdf(x, 0, 0.5)
leptokurtic_kurtosis = compute_kurtosis(leptokurtic_data)


platykurtic_data = np.random.normal(0, 2, 10000)
platykurtic_pdf = stats.norm.pdf(x, 0, 2)
platykurtic_kurtosis = compute_kurtosis(platykurtic_data)


plt.figure(figsize=(8, 4))
plt.plot(x, normal_pdf, label=f'Mesokurtic (K=3,  K={normal_kurtosis:.1f})', color='blue', linewidth=2)
plt.plot(x, leptokurtic_pdf, label=f'Leptokurtic (K>3,  K={leptokurtic_kurtosis:.1f})', color='yellow', linewidth=2)
plt.plot(x, platykurtic_pdf, label=f'Platykurtic (K<3, K={platykurtic_kurtosis:.1f})', color='red', linewidth=2)


plt.title("Figure 1.1- Types of Kurtosis")
plt.xlabel("Random data")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image

1.2 Diagnosis of Kurtosis¶

Kurtosis diagnosis can be assessed through three main approaches, a) Kurtosis Calculation - Estimate kurtosis $\{K(X)\}$ and interpret excess kurtosis $\{K(X)-3\}$; b) Visual Inspection, using Box plots, Histogram and Q-Q plots (Figure 1.1). Box Plots: Leptokurtic distributions show more outliers, while platykurtic ones show fewer. Histograms: Overlay a uniform density curve to compare with the target distribution (Figure 1.1). Q-Q Plots: Deviations from the line at extremes suggest heavy or light tails. Comparison with a Uniform Distribution, by (1) overlay the density function of $ U(M-\sigma/3, M+\sigma/3)$, (2) identify excess peak and tails (for unimodal distributions) or missing areas (for U-shaped distributions), (3)visualize differences in the CDF by measuring deviations from the uniform CDF (Figure 1.2) (Kotz and Edith 348-350). c) Hypothesis Testing , using D’Agostino’s $K^{2}$ Test, Jarque-Bera Test or Anscombe-Glynn Test (Bai and Serena 49,51). Jarque-Bera (JB) test: Test Statistic: $\text{JB} = N ( \frac{\hat{g}_{1}^2}{6} + \frac{[\hat{K}(X) - 3]^2}{24}) \stackrel{d}{\longrightarrow} \chi^2_2$, Where $g_{1} = \frac{\mu_{3}}{\sigma^3}$ and $\hat{g}_{1} = \frac{\hat{\mu}_{3}}{S^3}$ represent the population and sample skewness, respectively. Here, $ \mu_{3}$ denotes the third central moment, and $S$ is the standard deviation (Kim and White 3, Bastianin 634). Null Hypothesis: The data follow a mesokurtic (normal) distribution $(\hat{g}_{1} = 0, \hat{K}(X) = 3)$. Reject $H_0$ if $ JB > 5.99$ at the $5\%$ significance level or using p-value. D’Agostino’s $K^2$ test combines skewness ($g_1$) and kurtosis ($K(X)$) into a chi-squared statistic: $K^2 = Z_1^2 + Z_2^2 \stackrel{d}{\longrightarrow} \chi^2_2$, where $Z_1 = \frac{\hat{g}_1}{\sqrt{6/N}}$ and $Z_2 = \frac{\hat{K} (X) - 3}{\sqrt{24/N}}$. The null hipothesis is the same as in Jarque_Bera. Anscombe-Glynn test (Anscombe and Glynn 228), test Statistic:
$AG = \frac{\hat{K} (X) - 3}{\sqrt{\frac{24}{N}}} \stackrel{d}{\longrightarrow} N(0,1)$. Reject $H_0$ if $|AG| > 1.96$.

1.3 Demonstration¶

In [476]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import beta, uniform, norm

sns.set_theme(style="whitegrid", font_scale=1.1)
plt.rcParams.update({'font.size': 12, 'figure.dpi': 100})

fig, axs = plt.subplots(2, 2, figsize=(16, 4))
plt.subplots_adjust(hspace=0.3, wspace=0.3, top=0.85)

# ================================================
# Beta(0.5,0.5) PDF vs uniform
# ================================================
a, b = 0.5, 0.5
x = np.linspace(0, 1, 1000)
unif_a = 0.5 - np.sqrt(3/8)
unif_b = 0.5 + np.sqrt(3/8)

axs[0, 0].plot(x, beta.pdf(x, a, b), label='Beta(0.5,0.5)', color='navy')
axs[0, 0].plot(x, uniform.pdf(x, unif_a, unif_b - unif_a), 'r--', label=f'Uniform({unif_a:.2f},{unif_b:.2f})')
axs[0, 0].fill_between(x, beta.pdf(x, a, b), uniform.pdf(x, unif_a, unif_b - unif_a),
                        where=(beta.pdf(x, a, b) < uniform.pdf(x, unif_a, unif_b - unif_a)),
                        color='gray', alpha=0.3, label='Missing area')
axs[0, 0].set_xlabel('x')
axs[0, 0].set_ylabel('Density')
axs[0, 0].set_title('a)')
axs[0, 0].legend()

# ================================================
# Beta(0.5,0.5) CDF vs uniform
# ================================================
axs[0, 1].plot(x, beta.cdf(x, a, b), label='Beta CDF', color='navy')
axs[0, 1].plot(x, uniform.cdf(x, unif_a, unif_b - unif_a), 'r--', label='Uniform CDF')
axs[0, 1].fill_between(x, beta.cdf(x, a, b), uniform.cdf(x, unif_a, unif_b - unif_a),
                        where=(beta.cdf(x, a, b) < uniform.cdf(x, unif_a, unif_b - unif_a)),
                        color='gray', alpha=0.3, label='Missing area')
axs[0, 1].set_xlabel('x')
axs[0, 1].set_ylabel('Cumulative probability')
axs[0, 1].set_title('b)')
axs[0, 1].legend()

# ================================================
# Normal PDF Peak/Tails
# ================================================
mu, sigma = 0, 1
x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)
unif_a = mu - sigma * np.sqrt(3)
unif_b = mu + sigma * np.sqrt(3)

axs[1, 0].plot(x, norm.pdf(x, mu, sigma), label='Normal PDF', color='darkgreen')
axs[1, 0].plot(x, uniform.pdf(x, unif_a, unif_b - unif_a), 'r--', label=f'Uniform({unif_a:.2f},{unif_b:.2f})')
axs[1, 0].fill_between(x, norm.pdf(x, mu, sigma), uniform.pdf(x, unif_a, unif_b - unif_a),
                        where=(norm.pdf(x, mu, sigma) > uniform.pdf(x, unif_a, unif_b - unif_a)),
                        color='skyblue', alpha=0.3, label='Excess Area')
axs[1, 0].set_ylabel('Density')
axs[1, 0].set_title('c)')
axs[1, 0].legend()

# ================================================
# Normal CDF comparison
# ================================================
cdf = norm.cdf(x, mu, sigma)
uniform_cdf = np.clip((x - unif_a) / (unif_b - unif_a), 0, 1)

axs[1, 1].plot(x, cdf, 'b-', label='$F(x)$')
axs[1, 1].plot(x, uniform_cdf, 'r--', label='Uniform CDF')

axs[1, 1].annotate('Position of peak\nto the right of $M$', xy=(mu + sigma * 0.5, 0.65),
                    xytext=(mu + 2, 0.3),
                    arrowprops=dict(arrowstyle="->", color='k'),
                    ha='center')

axs[1, 1].set_xlabel('$x$')
axs[1, 1].set_ylabel('Cumulative Probability')
axs[1, 1].set_title('d)')
axs[1, 1].legend()


fig.suptitle('Figure 1.2- Diagnosis of Kurtosis based on Comparison of CDF with a Uniform Distribution', fontsize=14)

plt.show()
No description has been provided for this image

1.4 Damage Caused by excess Kurtosis¶

Excess kurtosis in financial returns, characterized by "fat tails" and a sharper peak than the normal distribution, leads to significant systemic and practical damages:

Underestimated tail risk: Traditional models (e.g., Gaussian) fail to account for the heightened probability of extreme returns, causing institutions to undervalue risks. This results in flawed Value at Risk (VaR) calculations and insufficient capital buffers, exposing firms to catastrophic losses during crises (López Martín et al. 2033); Mispricing of derivatives: Options and other tail-sensitive instruments are often priced using Black-Scholes models, which assume normality. Excess kurtosis leads to mispriced volatility (e.g., undervalued out-of-the-money options), distorting hedging strategies and amplifying systemic risk (Hull 158); Ineffective portfolio optimization: Modern portfolio theory relies on mean-variance frameworks, ignoring tail risks. Portfolios optimized under Gaussian assumptions are overly concentrated in high-kurtosis assets, increasing vulnerability to market shocks (Taleb 72); Regulatory shortcomings: Basel accords and stress-testing frameworks often use simplified distributions, underestimating the likelihood of extreme events. This creates regulatory blind spots, as seen in the 2020 COVID-19 market crash (Danielsson et al.).

1.5 Directions to address excess Kurtosis¶

Bounded mixture distributions: The U-GBC (Uniform Generalized Bicubic) and U-TSP (Uniform Two-Sided Power) models proposed by López Martín et al. explicitly recover empirical kurtosis while operating within finite domains. These avoid infinite variance pitfalls of stable Paretian/Lévy models and closely match financial data (2041); Extreme value theory (EVT): Focuses on tail behavior using generalized pareto distributions (GPD) to model exceedances beyond thresholds. EVT improves VaR and expected shortfall (ES) estimates by directly addressing fat tails (Embrechts et al. 109); Volatility clustering models: GARCH and stochastic volatility models capture time-varying volatility, indirectly reducing kurtosis by accounting for clustered extreme returns (Bollerslev 542); Non-extensive statistical mechanics: q-Gaussian distributions with $1 < q < 3$ flexibly model fat tails, though they face challenges in finite-variance regimes (Tsallis et al. 2040); Robust risk metrics: Replace VaR with ES or entropic risk measures, which better capture tail risk severity (Rockafellar and Uryasev); Student’s t-distribution: While it accommodates heavier tails than the normal distribution, its effectiveness diminishes for datasets requiring very low degrees of freedom (Blattberg and Gonedes 2038).

In [477]:
import numpy as np
import pandas as pd
from scipy import stats


np.random.seed(42)

# Generate normal and fat-tailed (Student's t) data
normal_data = np.random.normal(loc=0, scale=1, size=1000)
t_data = np.random.standard_t(df=3, size=1000)

# Anscombe-Glynn test function
def anscombe_glynn_test(data):
    excess_kurtosis = stats.kurtosis(data, fisher=True)
    n = len(data)
    ag = excess_kurtosis / np.sqrt(24 / n)
    p_value = 2 * (1 - stats.norm.cdf(np.abs(ag)))
    return ag, p_value

# Diagnostic tests
def run_tests(data, label):
    jb = stats.jarque_bera(data)
    k2 = stats.normaltest(data)
    ag, p_ag = anscombe_glynn_test(data)
    return {
        "Dataset": label,
        "Jarque-Bera Stat": jb.statistic,
        "Jarque-Bera p-value": jb.pvalue,
        "D’Agostino K² Stat": k2.statistic,
        "D’Agostino K² p-value": k2.pvalue,
        "Anscombe-Glynn Stat": ag,
        "Anscombe-Glynn p-value": p_ag
    }

results = pd.DataFrame([
    run_tests(normal_data, "Normal"),
    run_tests(t_data, "Student's t (df=3)")
])

# VaR
empirical_var = np.percentile(t_data, 5)
mean_t, std_t = np.mean(t_data), np.std(t_data)
normal_var = stats.norm.ppf(0.05, loc=mean_t, scale=std_t)

# Fit Student's t-distribution and calculate VaR
df, loc, scale = stats.t.fit(t_data)
t_var = stats.t.ppf(0.05, df=df, loc=loc, scale=scale)

var_results = pd.DataFrame({
    "Model": ["Normal VaR (5%)", "Empirical VaR (5%)", "t-Based VaR (5%)"],
    "Value": [normal_var, empirical_var, t_var]
})

print("Diagnostic Tests Results:")
print(results)
print("\nVaR Estimates:")
print(var_results)
Diagnostic Tests Results:
              Dataset  Jarque-Bera Stat  Jarque-Bera p-value  \
0              Normal          2.456373             0.292823   
1  Student's t (df=3)       1565.864886             0.000000   

   D’Agostino K² Stat  D’Agostino K² p-value  Anscombe-Glynn Stat  \
0            2.575518           2.758884e-01             0.427357   
1          146.183757           1.805513e-32            39.443466   

   Anscombe-Glynn p-value  
0                0.669119  
1                0.000000  

VaR Estimates:
                Model     Value
0     Normal VaR (5%) -2.575878
1  Empirical VaR (5%) -2.353898
2    t-Based VaR (5%) -2.304689

The diagnostic tests applied to synthetic normal data (Jarque-Bera: Stat=2.46, p=0.293; D’Agostino’s K²: Stat=2.58, p=0.276; Anscombe-Glynn: AG=0.43, p=0.669) confirm adherence to normality, while fat-tailed Student’s t-data exhibits significant deviations (JB=1565.86, $p\approx 0$; $K^2=146.18$, $p\approx 0$; AG=39.44, $p\approx 0$), rejecting normality due to excess kurtosis. Despite the normal model’s inflated VaR estimate (-2.58 vs. empirical -2.35 at $5\%$ attributed to its reliance on an overestimated standard deviation from the t-data’s heavy-tailed variance), fitting a Student’s t-distribution (ν=3.42, scale=1.04) aligns the t-based VaR (-2.30) closely with empirical results, demonstrating its efficacy in correcting kurtosis induced miscalculations.

2. Heteroscedasticity¶

Heteroscedasticity occurs when the variance of the error term $e_i$ in a regression model is nonconstant across observations. Formally, it is defined as: $E(e_i^2) = \sigma_i^2 \quad \text{(varying with } i\text{)}$. This violates the classical linear regression model assumption of homoscedasticity, where $E(e_i^2) = \sigma^2$ (constant for all $i$) (Gujarati and Porter 1–2).

Heteroscedasticity arises when the dispersion of residuals changes systematically with an explanatory variable or over time, often observed in cross-sectional data due to scale effects (e.g., income vs. savings) (Gujarati and Porter 3).

2.1 Diagnosis¶

Residual Plots: Visual inspection of residuals vs. predictors or fitted values (fan-shaped pattern); Formal Tests: Park Test: Regress $\ln(\text{residuals}^2)$ on $\ln(X)$ (Gujarati and Porter 285–286); Glejser Test: Regress absolute residuals on $X$ (Adegbilero-Iwari et al. 287–288); White Test: Auxiliary regression of squared residuals on predictors, their squares, and cross-products (Gujarati and Porter 289–290); Breusch-Pagan Test: Checks linear dependence of variance on predictors (Adegbilero-Iwari et al. 287).

2.2 Damage¶

OLS estimators remain unbiased but inefficient (higher variance); Biased standard errors lead to incorrect $t$-statistics and $p$-values, increasing Type I/II errors (Gujarati and Porter 280–281); Confidence intervals and hypothesis tests become unreliable.

2.3 Directions¶

Weighted Least Squares (WLS): Transform data using weights $1/\sqrt{X}$ or $1/X$ (Gujarati and Porter 293–294); Logarithmic transformation: Compress scales to stabilize variance (e.g., $\ln(Y)$ vs. $\ln(X)$) (Gujarati and Porter 297); Robust Standard Errors: Use White’s heteroscedasticity-consistent estimators (Gujarati and Porter 298–299); ARCH/GARCH Models: For time-series data with volatility clustering (Hsieh 307–308).

2.4 Demonstration¶

In [478]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

np.random.seed(42)
n = 50

x = np.linspace(1, 50, n)

y = np.linspace(20, 500, n)

data = pd.DataFrame({'x': x, 'y': y})

# ---------------------------
# Fit the OLS Model and Conduct Formal Tests
# ---------------------------
X = sm.add_constant(data['x'])
ols_model = sm.OLS(data['y'], X).fit()

print("Tabele 1.1: OLS Regression Summary:")
print(ols_model.summary())

# ------------------------------------------
# 3. Correcting Heteroscedasticity using WLS
# ------------------------------------------
# Compute absolute residuals and fitted values
data["abs_residuals"] = np.abs(ols_model.resid)
data["fitted_values"] = ols_model.fittedvalues
Tabele 1.1: OLS Regression Summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.147e+32
Date:                Sun, 23 Mar 2025   Prob (F-statistic):               0.00
Time:                        22:49:05   Log-Likelihood:                 1430.2
No. Observations:                  50   AIC:                            -2856.
Df Residuals:                      48   BIC:                            -2853.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.2041   2.68e-14   3.81e+14      0.000      10.204      10.204
x              9.7959   9.15e-16   1.07e+16      0.000       9.796       9.796
==============================================================================
Omnibus:                        4.966   Durbin-Watson:                   0.097
Prob(Omnibus):                  0.083   Jarque-Bera (JB):                2.236
Skew:                          -0.188   Prob(JB):                        0.327
Kurtosis:                       2.034   Cond. No.                         59.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [479]:
from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan Test
bp_test = het_breuschpagan(ols_model.resid, X)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print("\n OLS Breusch-Pagan Test Results:")
for label, value in zip(labels, bp_test):
    print(f"{label}: {value:.4f}")
 OLS Breusch-Pagan Test Results:
LM Statistic: 41.6877
LM-Test p-value: 0.0000
F-Statistic: 240.7305
F-Test p-value: 0.0000
In [480]:
# Fit auxiliary OLS model to get weights
model_temp = smf.ols("abs_residuals ~ fitted_values", data=data).fit()

# Compute weights
weights = model_temp.fittedvalues
weights=weights** -2
data["weights"] = weights

data.head()
# Fit WLS model
Y = data["y"].tolist()
X = data["x"].tolist()
X = sm.add_constant(X)  # add a intercept point
model_WLS = sm.WLS(Y, X, data["weights"]).fit()
In [481]:
# Diagnostic Plots
# ---------------------------
fig, axs = plt.subplots(1, 2, figsize=(11, 5))

# Plot 1: OLS Residuals vs. Fitted Values
axs[0].scatter(ols_model.fittedvalues, ols_model.resid, alpha=0.7, edgecolors='k')
axs[0].axhline(y=0, color='red', linestyle='--')
axs[0].set_title("a) OLS Residuals vs. Fitted Values")
axs[0].set_xlabel("Fitted Values")
axs[0].set_ylabel("Residuals")

# Plot 2: WLS Residuals vs. Fitted Values
axs[1].scatter(model_WLS.fittedvalues, model_WLS.resid, alpha=0.7, edgecolors='k', color='green')
axs[1].axhline(y=0, color='red', linestyle='--')
axs[1].set_title("b) WLS Residuals vs. Fitted Values")
axs[1].set_xlabel("Fitted Values")
axs[1].set_ylabel("Residuals")

plt.tight_layout()
fig.suptitle('Figure 1.3 - Diagnostic Plots: a) OLS Residuals vs Fitted Values, b) WLS Residuals vs Fitted Values', fontsize=14)

plt.show()
No description has been provided for this image
In [482]:
import pandas as pd
from statsmodels.stats.api import het_breuschpagan

# Perform Breusch-Pagan Test for OLS
bp_test = het_breuschpagan(ols_model.resid, X)
bp_test_results = {
    'Statistic': bp_test[0],
    'LM-Test p-value': bp_test[1],
    'F-Statistic': bp_test[2],
    'F-Test p-value': bp_test[3]
}

# Perform Breusch-Pagan Test for WLS
WLSbp_test = het_breuschpagan(model_WLS.resid, X)
WLSbp_test_results = {
    'Statistic': WLSbp_test[0],
    'LM-Test p-value': WLSbp_test[1],
    'F-Statistic': WLSbp_test[2],
    'F-Test p-value': WLSbp_test[3]
}

bp_results_df = pd.DataFrame({
    'OLS': bp_test_results,
    'WLS': WLSbp_test_results
})

print("\nBreusch-Pagan Test Results DataFrame:")
print(bp_results_df)
Breusch-Pagan Test Results DataFrame:
                          OLS       WLS
Statistic        4.168775e+01  0.001889
LM-Test p-value  1.070779e-10  0.965334
F-Statistic      2.407305e+02  0.001813
F-Test p-value   2.481934e-20  0.966210

Interpretation¶

The Breusch-Pagan test results show significant heteroscedasticity in the OLS model, with $p-values=1.0708 \times 10^{-10} <0.05$ indicating the residuals have non-constant variance (Heteroscedasticity). In contrast, the WLS model's results suggest no heteroscedasticity, as evidenced by $p-values= 0.9653>0.05$. These findings imply that WLS successfully addresses heteroscedasticity, making it more suitable for the data. OLS, however, may not be appropriate for data with heteroscedastic residuals (see also the Figure 1.3).

2.5 Conclusion¶

Kurtosis quantifies tail heaviness, not peakedness, with excess kurtosis in financial data leading to underestimated tail risks, mispriced derivatives, and flawed portfolio optimization. Diagnosis combines standardized moments, visual tools (Q-Q plots), and tests (Jarque-Bera). Remedies include bounded mixtures, EVT, and GARCH models to address fat-tailed distributions. Heteroscedasticity violates regression assumptions, causing inefficient OLS estimates and biased inference, detected via Breusch-Pagan tests or residual plots. Corrective measures like WLS, robust errors, and ARCH/GARCH restore model validity by stabilizing variance.

References¶

Adegbilero-Iwari, Oluwaseun, et al. "Comparison of Different Tests for Detecting Heteroscedasticity in Datasets." Annals. Computer Science Series, vol. 18, no. 2, 2020, pp. 77–85.

Anderson, S., Kenton, W. 2024. "Kutosis". Online: https://www.investopedia.com/terms/k/kurtosis.asp.

Anscombe, F.J., and Glynn, W.J. (1983). "Distribution of the Kurtosis Statistic B2 for Normal Samples." Biometrika, 70, 227-234. https://doi.org/10.1093/biomet/70.1.227.

Bai, Jushan, and Serena Ng. "Tests for Skewness, Kurtosis, and Normality for Time Series Data." Journal of Business & Economic Statistics 23.1 (2005): 49-60.

Balanda, Kevin P., and Helen L. MacGillivray. "Kurtosis and Spread." Canadian Journal of Statistics 18.1 (1990): 17-30.

Balanda, Kevin P., and H. L. MacGillivray. "Kurtosis: A Critical Review." The American Statistician 42.2 (1988): 111-119.

Bastianin, Andrea. "Robust Measures of Skewness and Kurtosis for Macroeconomic and Financial Time Series." Applied Economics 52.7 (2020): 637-670.

Blattberg, Robert, and Nicholas Gonedes. “A Comparison of the Stable and Student Distributions as Statistical Models for Stock Prices.” Journal of Business, vol. 47, no. 2, 1974, pp. 244–280.

Bollerslev, Tim. “A Conditionally Heteroskedastic Time Series Model for Speculative Prices and Rates of Return.” Review of Economics and Statistics, vol. 69, no. 3, 1987, pp. 542–547.

Danielsson, Jon, et al. “Model Risk of Risk Models.” Journal of Financial Stability, vol. 23, 2016, pp. 79–91.

Embrechts, Paul, et al. Modelling Extremal Events for Insurance and Finance. Springer, 1997.

Finucan, H. M. "A Note on Kurtosis." Journal of the Royal Statistical Society: Series B (Methodological) 26.1 (1964): 111-112.

Gujarati, Damodar N., and Dawn C. Porter. Essentials of Econometrics. 5th ed., McGraw-Hill, 2009.

Hsieh, David A. "Modeling Heteroscedasticity in Daily Foreign-Exchange Rates." Journal of Business & Economic Statistics, vol. 7, no. 3, 1989, pp. 307–317.

Hull, John C. Options, Futures, and Other Derivatives. 10th ed., Pearson, 2017.

Kallner, Anders. Laboratory Statistics: Methods in Chemistry and Health Sciences. Elsevier, 2017.

Kim, Tae-Hwan, and Halbert White. "On More Robust Estimation of Skewness and Kurtosis." Finance Research Letters 1.1 (2004): 56-73.

Kotz, Samuel, and Edith Seier. "Visualizing Peak and Tails to Introduce Kurtosis." The American Statistician 62.4 (2008): 348-354.

Lee, Jay. "Statistics, Descriptive." (2009): 422-428.

López Martín, María del Mar, et al. “Treatment of Kurtosis in Financial Markets.” Physica A: Statistical Mechanics and Its Applications, vol. 391, no. 9, 2012, pp. 2032–2045.

Mardia, Kanti V. "Applications of Some Measures of Multivariate Skewness and Kurtosis in Testing Normality and Robustness Studies." Sankhyā: The Indian Journal of Statistics, Series B (1974): 115-128.

Prokop, Richard J., and Anthony P. Reeves. "A Survey of Moment-Based Techniques for Unoccluded Object Representation and Recognition." CVGIP: Graphical Models and Image Processing 54.5 (1992): 438-460.

Rockafellar, R. Tyrrell, and Stanislav Uryasev. “Optimization of Conditional Value-at-Risk.” Journal of Risk, vol. 2, no. 3, 2000, pp. 21–41.

Ruppert, David. "What Is Kurtosis? An Influence Function Approach." The American Statistician 41.1 (1987): 1-5.

Taleb, Nassim Nicholas. The Black Swan: The Impact of the Highly Improbable. Random House, 2007.

Tsallis, Constantino, et al. “Nonextensive Statistical Mechanics and Its Applications.” Lecture Notes in Physics, Springer, 2001.

https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test

https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test

3. Skewness¶

3.1 Definition:¶

Skewness is a statistical measure that quantifies the asymmetry of a data distribution. A perfectly symmetric distribution(such as normal distribution) has a skewness of zero, meaning the left and right sides of the distribution is mirroring each other. $\qquad (3)$

For a univariate dataset $Y_1$,$Y_2$,...,$Y_N$, the Fisher-Pearson coefficient of skewness is defined as:

$$ g_1 = \frac{m_3}{m_2^{3/2}} \qquad (2)$$ where $$m_i =\frac{1}{N} \sum_{n=1}^N{(x[n]-\bar{x})^i} \qquad (1)$$ is the biased sample central moment

where,

  • $\bar{x}$ is the mean

  • N is the number of data points

Many software programs (such as scipy.stats.skew) actually compute the Adjusted Fisher-Pearson Coefficient of Skewness that is defined as :

$$G_1 = \frac{\sqrt{N(N-1)}}{N-2} * g_1 \qquad (1)$$

where,

  • $g_1$ is the Fisher-Pearson coefficient of skewness defined aboved

Interpreting of Skewness

  • Zero Skew: Data is symmetrical which means it is normal distribution
  • Positive Skew (Right Skew): Right tail is longer
  • Negative Skew (Left Skew): Left tail is longer

3.2 Description:¶

Skewness measures the symmetry of data distribution, especially in financial markets, where asset returns are often not completely symmetric. This means that the distribution of data on both sides of the mean may be biased, which can affect investor decisions, the robustness of financial models, and the predictive power of machine learning models.

For example (In Stock Market:):

Stock market returns are usually right-skewed (Skewness > 0):

  • Small increases are common, but there are occasional large increases such as Tesla or Apple

Returns may be left-skewed (Skewness < 0) when the market crashes:

  • Returns are stable most of the time, but there are occasional large drops

Skewness not only affects investors' return expectations, but also affects the robustness of financial models. For example, in option pricing, risk management, and machine learning modeling, data is assumed to be normally distributed, but if the data is too skewed, it may lead to bad predictions.

3.3 Demonstration:¶

In [483]:
# !pip install yfinance
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
import yfinance as yf

stock_data = yf.download("AAPL", start="2024-01-01", end="2025-01-01")
stock_data["Returns"] = stock_data["Close"].pct_change()
data_apple= stock_data["Returns"].dropna()
returns_skewness = skew(data_apple)
print("\n")
print(f"AAPL Skewness(return data from 2024-01-01 to 2025-01-01): {returns_skewness:.2f}")
[*********************100%***********************]  1 of 1 completed

AAPL Skewness(return data from 2024-01-01 to 2025-01-01): 0.50

Diagram:

In [484]:
plt.figure(figsize=(8, 4))
sns.histplot(stock_data["Returns"].dropna(), bins=50, kde=True, color='purple')
plt.title(f"AAPL Yield (Skewness = {returns_skewness:.2f})")
plt.show()
No description has been provided for this image

3.4 Diagnosis¶

** Observe the histogram: see if the distribution is symmetrical (Above demostration)

Calculate values: use scipy.stats.skew() to calculate

  • If greater than 0, it is right skew
  • If less than 0, it is left skew
In [485]:
import scipy
from scipy.stats import probplot

print('Calculate values - scipy skew: ',scipy.stats.skew(data_apple))
Calculate values - scipy skew:  0.5013931445602677

Interpreting Skewness in Box Plot

  1. Observe the relative position of the mean and median
  • If the mean is greater than the median (upward), it means that the data is right-skewed (positively skewed);
  • If the mean is less than the median (downward), it means that the data is left-skewed (negatively skewed).
  • If the mean is almost same as the median, it means that the data is normal distributed

Sample: As can be seen from the below figure as example for apple return, the mean (green line) and the median (red line) almost overlap, indicating that the data is highly symmetrical and the Skewness is close to 0.

  1. Observe the length of the whisker
  • If the data is extremely right-skewed (Skewness > 0), the whisker above will be significantly longer than the whisker below, and vice versa.

Sample: The whisker in below figure (the thin upward line) is slightly longer than the whisker below, indicating that the data may have a slight positive skewness (Skewness > 0), but it is not obvious.

  1. Observe outliers The data has several outliers extending upward (to the right), indicating that there are some extremely large positive return values, which may slightly increase the mean and make the data slightly positively skewed.
In [486]:
# Boxplot
plt.figure(figsize=(6, 6))
sns.boxplot(y=stock_data["Returns"].dropna(), color="lightblue")
plt.title("Boxplot of stock AAPL returns")
plt.ylabel("Return Rate")
plt.axhline(stock_data["Returns"].mean(), color="green", linestyle="--", label="Mean")
plt.axhline(stock_data["Returns"].median(), color="red", linestyle="-.", label="Median")
plt.legend()
plt.show()
No description has been provided for this image

Interpreting Skewness in QQ Plot

Right-skewed (Positive Skewness):

  • Points on the right tail (upper right corner) deviate above the line.
  • Points on the left tail (lower left corner) deviate below the line.

Left-skewed (Negative Skewness):

  • Points on the right tail (upper right corner) fall below the line.
  • Points on the left tail (lower left corner) deviate above the line.

Normal Distribution:

  • Points roughly follow the diagonal line.
In [487]:
# QQ Plot
data = stock_data["Returns"].dropna()
fig, ax = plt.subplots(figsize=(6, 6))
probplot(data, dist="norm", plot=ax)
plt.title("QQ Plot of AAPL Returns 2024- 2025")
plt.show()
No description has been provided for this image

3.6 Damage:¶

  • Impact on the stability of the mean: the mean is greatly affected by extreme values (outliers), leading to misleading analysis result
  • Misleading risk assessment: when the skewness is large, traditional VaR (value at risk) may underestimate extreme risks
  • Affect machine learning models: some regression models assume the data is symmetry, and the effect becomes worse when the skewness is too large

3.7 Directions¶

In finance, addressing skewness is crucial as it impacts risk assessment and model accuracy. When encountering skewed data, we can consider the following approaches:

3.7.1. Data Transformation: Apply logarithmic, square root, or Box-Cox transformations to reduce skewness. For example, the Box-Cox transformation is defined as:
$$ y(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\ \ln(x) & \text{if } \lambda = 0. \end{cases}$$ 3.7.2. Robust Statistical Methods: Use non-parametric techniques (e.g., bootstrapping) to estimate parameters without assuming normality (Doane and Seward 6-15).
3.7.3. Alternative Distributions: Model data using skewed distributions like the skew-normal or generalized hyperbolic distribution (Azzalini 160).
3.7.4. Adjust Financial Models: Modify models such as CAPM or Value-at-Risk (VaR) to incorporate skewness, as traditional models often underestimate tail risk (Fama and French 21).
3.7.5. Diagnostic Tools: Combine skewness statistics (e.g., Pearson’s $Sk_2 = 3 \frac{(\bar{x} - m)}{s}$) with visualizations (histograms, Q-Q plots) to assess asymmetry (Doane and Seward 2-8).

Conclussion¶

Skewness is an important statistical measurement of the asymmetry of data. And it is important in finance, economics and statistical modeling.

  • The skewness of a normal distribution should be close to 0, but real-world data often not have the skewness of 0.
  • Right-skewed data (Skewness > 0) means that rises are slow but occasionally surge, a classic example being stock market returns.
  • Left-skewed data (Skewness < 0) means that large declines are rare but have far-reaching consequences, such as the distribution of returns during a market crash.

References
Azzalini, Adelchi. “The Skew-Normal Distribution and Related Multivariate Families.” Scandinavian Journal of Statistics, vol. 32, no. 2, 2005, pp. 159–88.
Doane, David P., and Lori E. Seward. “Measuring Skewness: A Forgotten Statistic?” Journal of Statistics Education, vol. 19, no. 2, 2011, https://doi.org/10.1080/10691898.2011.11889611.

Fama, Eugene F., and Kenneth R. French. “Common Risk Factors in the Returns on Stocks and Bonds.” Journal of Financial Economics, vol. 33, no. 1, 1993, pp. 3–56.

  1. "1.3.5.11. Measures of Skewness and Kurtosis". NIST. Retrieved 18 March 2012.
  2. SciPy.org. (n.d.). scipy.stats.skew — SciPy v1.11.3 Manual. Retrieved March 2025
  3. Illowsky, Barbara; Dean, Susan (27 March 2020). "2.6 Skewness and the Mean, Median, and Mode – Statistics". OpenStax. Retrieved 21 December 2022.
  4. Zwillinger, D. and Kokoska, S. (2000). CRC Standard Probability and Statistics Tables and Formulae. Chapman & Hall: New York. 2000

4 Addressing sensitivity to outliers¶

4.1 Definition:¶

Definition:

  • Outliers are data points that significantly deviate from the general pattern of the dataset.

  • These extreme values can disproportionately influence statistical measures and models, leading to misleading conclusions.

  • Outliers can arise due to errors in data collection, recording mistakes, or represent legitimate but rare occurrences in a dataset (Aguinis et al. 271).

4.2 Description:¶

Outliers are broadly classified into three categories:

  • Error Outliers: These result from inaccuracies such as data entry mistakes, measurement errors, or sampling issues. They are non-legitimate observations that should be corrected or removed (Aguinis et al. 282).

  • Interesting Outliers: These are valid but extreme observations that might offer valuable insights. Instead of removing them, researchers should analyze them separately to identify patterns or novel findings (Aguinis et al. 284).

  • Influential Outliers: These points significantly impact model fit and parameter estimates. Their presence can change substantive conclusions, making it necessary to assess their influence carefully (Aguinis et al. 281).

4.3 Demonstration:¶

To illustrate the impact of outliers, we use Apple Inc. (AAPL) stock return data and apply a percentile-based capping method to handle extreme values.

In [488]:
import numpy as np
import pandas as pd
import yfinance as yf
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew

# 1. Get Apple stock data
stock_data = yf.download("AAPL", start="2024-01-01", end="2025-01-01")
stock_data["Returns"] = stock_data["Close"].pct_change()
apple_data = stock_data["Returns"].dropna()

# 2. Demonstrate the effect of outliers on skewness
original_skewness = skew(apple_data)
print(f"Original Skewness: {original_skewness:.2f}")

# 3. Introduce an artificial outlier
apple_data_with_outlier = apple_data.copy()
apple_data_with_outlier.iloc[0] = 0.20  # Adding a large outlier
outlier_skewness = skew(apple_data_with_outlier)
print(f"Skewness with Outlier: {outlier_skewness:.2f}")

# 4. Demonstrate outlier capping using percentiles
lower_bound = np.percentile(apple_data, 5)
upper_bound = np.percentile(apple_data, 95)
capped_data = np.clip(apple_data, lower_bound, upper_bound)
capped_skewness = skew(capped_data)
print(f"Skewness after Capping: {capped_skewness:.2f}")
[*********************100%***********************]  1 of 1 completed
Original Skewness: 0.50
Skewness with Outlier: 4.75
Skewness after Capping: -0.19

This method limits extreme values within a specified range, thereby reducing their impact without complete removal. As Aguinis et al. (2013) suggest, outlier handling should be transparent and context-driven to ensure accurate interpretations of results (281).

4.4 Diagnosis:¶

  • Original Returns: The dataset contains noticeable outliers, primarily in the right tail, indicating unusually high positive returns.

  • Capped Returns: After applying the capping technique, the distribution appears more symmetrical, demonstrating reduced skewness and a more stable dataset (Aguinis et al. 284).

4.5 Damage:¶

  • Original Returns: The presence of outliers can distort statistical measures like the mean and standard deviation, potentially misleading financial analysis and predictions (Aguinis et al. 275).

  • Capped Returns: By capping extreme values, we mitigate the distortion while preserving the underlying distribution, leading to more robust financial models (Aguinis et al. 280).

4.7 Directions¶

To mitigate the impact of outliers, we can follow these steps:

4.7.1. Define Outliers

  • Error Outliers: Caused by data entry errors, misreporting, or technical glitches (e.g., erroneous stock prices) (Aguinis et al. 270).
  • Influential Outliers: Legitimate but extreme observations affecting model parameters (e.g., Black Swan events impacting VaR calculations) (Aguinis et al. 281).

4.7.2. Identify Outliers:

  • Visual Methods: Use time-series plots or box plots to detect anomalies (Aguinis et al. 276).

  • Statistical Tests:

    • Grubbs’ Test: Identify outliers in univariate data:
      $ G = \frac{\max |Y_i - \bar{Y}|}{s}$ where $Y_i$ is a data point, $\bar{Y}$ is the sample mean, and $s$ is the standard deviation (Grubbs 1).
    • Mahalanobis Distance: For multivariate outliers in portfolio returns:
      $D^2 = (\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu})$ where $\mathbf{x}$ is the observation vector, $\mathbf{\mu}$ is the mean vector, and $\mathbf{\Sigma}$ is the covariance matrix (Aguinis et al. 277).

4.7.3. Handle Outliers:

  • Winsorization: Cap extreme values at the 1st and 99th percentiles to reduce skewness (Aguinis et al. 279).
  • Robust Models: Use quantile regression or M-estimation, which minimizes the influence of outliers (Rousseeuw and Leroy 2003).
  • Respecification: Adjust models by adding dummy variables for crisis periods or nonlinear terms (Aguinis et al. 289).

4.7.4 Transparency: Report analyses with and without outliers to assess robustness (Aguinis et al. 289).

4.8 Outliers vs. Extreme Values:¶

  • Outliers: Deviate due to measurement errors or data entry mistakes. They should be corrected or removed (Aguinis et al. 283).
  • Extreme Values: Represent natural variability in the data and may be meaningful, requiring careful interpretation (Aguinis et al. 284).

To differentiate between outliers and extreme values, we use statistical techniques like:¶

  • Z-score method: Identifies points beyond ±3 standard deviations as potential outliers (Aguinis et al. 278).
  • Interquartile Range (IQR): Defines outliers as values below Q1 - 1.5IQR or above Q3 + 1.5IQR (Aguinis et al. 277).
In [489]:
# Create boxplots for original and capped returns
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sns.boxplot(y=apple_data, color='purple')
plt.title('Original Returns')

plt.subplot(1, 2, 2)
sns.boxplot(y=capped_data, color='green')
plt.title('Capped Returns')

plt.show()
No description has been provided for this image

The boxplots clearly show that capping reduces the presence of extreme values, making the data more manageable for analysis (Aguinis et al. 281).

Conclusion:¶

  • Handling outliers appropriately is critical in financial analysis and statistical modeling.

  • The choice of method should align with the dataset's characteristics and the research objective. As Aguinis et al.

  • Emphasize, transparency in defining, identifying, and handling outliers is essential for ensuring reliable and replicable results (281).

References¶

Aguinis, Herman, et al. “Best-Practice Recommendations for Defining, Identifying, and Handling Outliers.” Organizational Research Methods, vol. 16, no. 2, 2013, pp. 270–301.

Grubbs, Frank E. “Procedures for Detecting Outlying Observations in Samples.” Technometrics, vol. 11, no. 1, 1969, pp. 1–21.
Rousseeuw, Peter J., and Annick M. Leroy. Robust Regression and Outlier Detection. Wiley, 2003.

5 Joining Time Series with Different Frequencies¶

5.1 Definition¶

Joining time series of different frequencies involves merging datasets where observations are recorded at mismatched intervals, such as daily stock prices and quarterly earnings reports. Formally, let $Y_t$ represent a high-frequency series (e.g., daily data) and $X_s$ a low-frequency series (e.g., quarterly data). The alignment challenge is to reconcile $X_s$ with $Y_t$ such that:

$$Y_t = f(X_s) + \varepsilon_t,$$

where $f(\cdot)$ is a function that bridges the frequency gap, often through interpolation, aggregation, or state-space synchronization (Casals, Jerez, and Sotoca 5). State-space models formalize this alignment by decomposing time series into latent components governed by transition equations. For instance, a quarterly series $\mathbf{z}_t$ might follow:

$$\begin{aligned} \mathbf{x}_{t+1} &= \Phi \mathbf{x}_t + \Gamma \mathbf{u}_t + E \mathbf{a}_t \quad &\text{(State equation)} \\ \mathbf{z}_t &= H \mathbf{x}_t + D \mathbf{u}_t + \mathbf{a}_t \quad &\text{(Observation equation)}, \end{aligned}$$

where $\mathbf{x}_t$ represents unobserved states and $\mathbf{a}_t$ is noise (Casals, Jerez, and Sotoca 8). Aggregation matrices, such as summing quarterly values into annual totals, further complicate alignment by altering observability (Casals, Jerez, and Sotoca 10).

5.2 Description¶

Mismatched frequencies are pervasive in fields like economics (daily stock prices vs. quarterly GDP) and environmental science (hourly temperature vs. monthly $CO_2$ levels). Key challenges include:

a) Structural Misalignment: Direct merging creates missing values (e.g., monthly GDP repeated daily).
b) Loss of Information: Aggregation obscures high-frequency trends. For example, summing daily sales into monthly totals masks intra-month volatility.
c) Temporal Causality: Using future low-frequency data (e.g., annual earnings) to explain past high-frequency values introduces look-ahead bias (Fama 385).

5.3 Demonstration¶

Consider an analysis of AAPL stock using daily sales data $Y_t$ and a monthly advertising budget $X_s$. A naive approach to merging would align $X_s$ only on month-end dates, leaving missing values for other days. A common solution is forward-filling, where the monthly budget is carried forward to all days in that month. However, this assumes a constant daily impact of advertising, which may not reflect reality and can introduce biases in causal inference. To illustrate, we use AAPL stock data from yfinance, focusing on the relationship between daily stock returns and marketing expenditures. This highlights the importance of properly handling time series with different frequencies to ensure accurate conclusions.

In [490]:
"""AAPL Stock Analysis (2010-2024) - Demonstration"""
!pip install yfinance --quiet
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# =============================================================================
# Data Acquisition
# =============================================================================
stock = yf.Ticker("AAPL")
data = stock.history(start='2010-01-01', end='2024-12-31')[['Close']]

# =============================================================================
# Returns Calculation
# =============================================================================
# Daily returns
data['Daily Return'] = data['Close'].pct_change() * 100

# Yearly returns calculation
def calculate_yearly_returns(df):
    years = df.index.year.unique()
    yearly_data = []
    for year in years:
        year_df = df[df.index.year == year]
        if not year_df.empty:
            start = year_df.iloc[0]['Close']
            end = year_df.iloc[-1]['Close']
            yearly_data.append((year_df.index[-1], ((end/start)-1)*100))
    return pd.DataFrame(yearly_data, columns=['Date', 'Return']).set_index('Date')

yearly_returns = calculate_yearly_returns(data)

data['yearly_returns']=yearly_returns
data.head(10)
Out[490]:
Close Daily Return yearly_returns
Date
2010-01-04 00:00:00-05:00 6.440333 NaN NaN
2010-01-05 00:00:00-05:00 6.451466 0.172867 NaN
2010-01-06 00:00:00-05:00 6.348847 -1.590626 NaN
2010-01-07 00:00:00-05:00 6.337108 -0.184911 NaN
2010-01-08 00:00:00-05:00 6.379241 0.664859 NaN
2010-01-11 00:00:00-05:00 6.322965 -0.882164 NaN
2010-01-12 00:00:00-05:00 6.251043 -1.137477 NaN
2010-01-13 00:00:00-05:00 6.339215 1.410516 NaN
2010-01-14 00:00:00-05:00 6.302501 -0.579151 NaN
2010-01-15 00:00:00-05:00 6.197174 -1.671203 NaN
In [491]:
# =============================================================================
# Returns Visualization
# =============================================================================
def plot_returns():
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

    # Daily returns plot
    ax1.plot(data['Daily Return'], color='#1f77b4', linewidth=0.8)
    ax1.set_title('Daily Returns (2010-2024)', fontsize=14)
    ax1.set_ylabel('Return (%)', fontsize=12)
    ax1.grid(True, alpha=0.3)

    # Yearly returns plot
    colors = ['#2ca02c' if x>0 else '#d62728' for x in yearly_returns['Return']]
    ax2.scatter(yearly_returns.index, yearly_returns['Return'], c=colors, s=100)
    ax2.plot(yearly_returns.index, yearly_returns['Return'], color='gray', alpha=0.3)
    for date, ret in zip(yearly_returns.index, yearly_returns['Return']):
        ax2.text(date, ret, f'{ret:.1f}%', ha='center', va='bottom')
    ax2.set_title('Annual Returns', fontsize=14)
    ax2.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    plt.tight_layout()
    plt.show()

plot_returns()
No description has been provided for this image

5.4. Diagnosis¶

Detect frequency mismatches via:

a) Metadata Checks: Compare series intervals (e.g., daily vs. quarterly).
b) Missing Values: High NaN counts post-merge indicate misalignment.
c) Visual Inspection: Abrupt changes in low-frequency data when plotted against high-frequency series.
d) Statistical Tests: Ljung-Box tests for residual autocorrelation after merging (Hyndman and Athanasopoulos 145).

In [492]:
# =============================================================================
# Model Diagnostics
# =============================================================================
from statsmodels.tsa.statespace.sarimax import SARIMAX

def run_diagnostics():
    model = SARIMAX(data['Close'], order=(1,1,1)).fit(disp=False)
    print("Model Summary:")
    print(model.summary())

    print("\nDiagnostic Plots:")
    model.plot_diagnostics(figsize=(10,8))
    plt.tight_layout()
    plt.show()

run_diagnostics()
Model Summary:
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                  Close   No. Observations:                 3773
Model:               SARIMAX(1, 1, 1)   Log Likelihood               -7079.711
Date:                Sun, 23 Mar 2025   AIC                          14165.422
Time:                        22:49:15   BIC                          14184.129
Sample:                             0   HQIC                         14172.073
                               - 3773                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.5421      0.254      2.138      0.033       0.045       1.039
ma.L1         -0.5623      0.251     -2.243      0.025      -1.054      -0.071
sigma2         2.4991      0.023    106.676      0.000       2.453       2.545
===================================================================================
Ljung-Box (L1) (Q):                   0.15   Jarque-Bera (JB):             17145.13
Prob(Q):                              0.70   Prob(JB):                         0.00
Heteroskedasticity (H):             119.88   Skew:                             0.11
Prob(H) (two-sided):                  0.00   Kurtosis:                        13.44
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Diagnostic Plots:
No description has been provided for this image

5.5 Damage¶

Ignoring frequency mismatches leads to:

a) Look-Ahead Bias: Future quarterly earnings influencing past daily stock returns (Fama 410).
b) Overstated Correlations: Artificial inflation from repeated low-frequency values.
c) Model Instability: Violations of equidistant interval assumptions in models like ARIMA (Hyndman and Athanasopoulos 72).
d) Resource Misallocation: Flawed business decisions due to incorrect causal links (e.g., misattributing sales spikes to advertising).

5.6 Directions¶

5.6.1 Temporal Aggregation/Disaggregation¶

a) Aggregation: Convert high-frequency data to low-frequency via averaging or summing. For example, daily sales become monthly averages.
b) Disaggregation: Use linear interpolation or splines to estimate high-frequency values from low-frequency data. However, this risks overfitting if the underlying process is nonlinear.

5.6.2 Mixed-Frequency Models¶

MIDAS (Mixed Data Sampling) regression incorporates variables sampled at different frequencies without aggregation. For example, daily stock returns $ Y_t$ can be modeled as:

$Y_t = \beta_0 + \sum_{k=0}^K \beta_k X_{s-k} + \varepsilon_t$,

where $X_{s-k}$ represents lagged quarterly earnings (Ghysels, Santa-Clara, and Valkanov 63).

5.6.3 State-Space Models¶

Kalman filters in state-space frameworks estimate latent states that align both series. For example, a quarterly earnings series $X_s$ can be embedded into a daily state-space model:

$$\begin{aligned} \text{Daily state: } \mathbf{x}_t &= \Phi \mathbf{x}_{t-1} + \mathbf{w}_t \\ \text{Observation: } Y_t &= H \mathbf{x}_t + \varepsilon_t, \end{aligned} $$

where $\mathbf{x}_t$ captures latent factors influencing both daily prices and quarterly earnings (Casals, Jerez, and Sotoca 14).

In [493]:
# Aggregate daily returns to yearly mean
data['Year'] = data.index.year
yearly_avg_return = data.groupby('Year')['Daily Return'].mean()

# Merge with yearly returns
combined_yearly = pd.concat([yearly_returns, yearly_avg_return], axis=1)
combined_yearly.columns = ['Yearly Return (%)', 'Avg Daily Return (%)']
print(combined_yearly.head())
                           Yearly Return (%)  Avg Daily Return (%)
2010-12-31 00:00:00-05:00          50.721916                   NaN
2011-12-30 00:00:00-05:00          22.887422                   NaN
2012-12-31 00:00:00-05:00          30.558623                   NaN
2013-12-31 00:00:00-05:00           4.750874                   NaN
2014-12-31 00:00:00-05:00          42.628382                   NaN
In [494]:
# Disaggregate yearly returns to daily frequency
yearly_returns_daily = yearly_returns.resample('D').ffill().reindex(data.index)
data_combined = data.join(yearly_returns_daily.rename(columns={'Return': 'Yearly Return'}))

# Plot results
plt.figure(figsize=(10, 4))
plt.plot(data_combined['Daily Return'], label='Daily Return', alpha=0.5)
plt.plot(data_combined['Yearly Return'], label='Yearly Return (FFill)', color='red')
plt.legend()
plt.show()
No description has been provided for this image
In [495]:
from sklearn.linear_model import LinearRegression

yearly_returns_lagged = yearly_returns.shift(1)
data_midas = data.join(yearly_returns_lagged.rename(columns={'Return': 'Lagged Yearly Return'}))


data_midas_clean = data_midas.dropna()
X = data_midas_clean[['Lagged Yearly Return']]
y = data_midas_clean['Daily Return']

model = LinearRegression()
model.fit(X, y)
print(f"MIDAS Coefficient: {model.coef_[0]:.4f}, Intercept: {model.intercept_:.4f}")
MIDAS Coefficient: -0.0024, Intercept: -0.0087
In [496]:
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX

data['Year'] = data.index.year
yearly_returns['Year'] = yearly_returns.index.year

data_merged = data.merge(
    yearly_returns[['Year', 'Return']],
    on='Year',
    how='left',
    suffixes=('', '_Yearly')
)

data_merged['Return'] = data_merged['Return'].astype(float)


data_clean = data_merged[['Daily Return', 'Return']].dropna()

# Fit SARIMAX model (Kalman filter)
mod = SARIMAX(
    endog=data_clean['Daily Return'],
    exog=data_clean['Return'],
    order=(1, 0, 0)
)
res = mod.fit(disp=False)
print(res.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:           Daily Return   No. Observations:                 3772
Model:               SARIMAX(1, 0, 0)   Log Likelihood               -7465.057
Date:                Sun, 23 Mar 2025   AIC                          14936.114
Time:                        22:49:17   BIC                          14954.820
Sample:                             0   HQIC                         14942.764
                               - 3772                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Return         0.0034      0.001      5.892      0.000       0.002       0.004
ar.L1         -0.0434      0.011     -3.996      0.000      -0.065      -0.022
sigma2         3.0657      0.040     76.954      0.000       2.988       3.144
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):              3951.29
Prob(Q):                              0.98   Prob(JB):                         0.00
Heteroskedasticity (H):               1.39   Skew:                            -0.07
Prob(H) (two-sided):                  0.00   Kurtosis:                         8.01
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [ ]:
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf

fig = res.plot_diagnostics(figsize=(10, 6))
plt.suptitle('SARIMAX Model Diagnostics', y=1.02)
plt.tight_layout()
plt.show()

5.7 Conclusion¶

Joining times series with differentes frequencies (e.g., daily vs. quarterly) introduces structural misalignment, information loss, and look-ahead bias if handled naively. Risks include inflated correlations, model instability, and flawed causal inference. Solutions like MIDAS regression, temporal aggregation/disaggregation, and state-space models (e.g., Kalman filters) reconcile frequencies while preserving data integrity. Proper alignment mitigates biases and ensures accurate analysis in econometric or financial applications.

References¶

Casals, José, Miguel Jerez, and Sonia Sotoca. Modeling and Forecasting Time Series Sampled at Different Frequencies. European Commission, 2005.

Fama, Eugene F. “Efficient Capital Markets: A Review of Theory and Empirical Work.” The Journal of Finance, vol. 25, no. 2, 1970, pp. 383–417.

Ghysels, Eric, et al. “Predicting Volatility: Getting the Most Out of Return Data Sampled at Different Frequencies.” Journal of Econometrics, vol. 131, no. 1-2, 2006, pp. 59–95.

Hyndman, Rob J., and George Athanasopoulos. Forecasting: Principles and Practice. OTexts, 2018.

Wei, William W. S. “Some Consequences of Temporal Aggregation in Seasonal Time Series Models.” Seasonal Analysis of Economic Time Series, edited by Arnold Zellner, U.S. Bureau of the Census, 1978, pp. 433–448.

In [ ]:
import warnings
warnings.filterwarnings('ignore')
import sys
import os

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

def ignore_exceptions(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception:
            pass
    return wrapper

!apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic --quiet
!pip install nbconvert --quiet
!apt-get install pandoc --quiet

#!jupyter nbconvert --to pdf --no-input "/content/MFE_610_GWP1.ipynb"
!jupyter nbconvert --to html --no-input "/content/MFE_610_GWP1.ipynb"
from google.colab import files
#files.download("/content/MFE_610_GWP1.pdf")
files.download("/content/MFE_610_GWP1.html")