Introduction¶
The global oil crude market is defined by a complex structural and volatile non-linearities that make predictions and inference in tradictional econometric models fail or be bised. Historically approachs based on purelly endogenous process, such as Autoregressive Moving Average (ARMA) and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models fail when trying to capture the multifaceted market prices behavior because they exclusivelly depende on excessive extrapolations of historical market prices ignoring the influence of exogenous factors (Alvi 2-3). Asdescribed by Alvi (4), the oil dinamics is influenced by factors that include (geo)political instability , global demand and supply fundamentals, market sentiment and others factors (Alvi 35-38). This project uses the Alvi proposal, that focus on Probabilistic Graph Models (PGMs) as a tool that meigate the bias described above and uses also Bayesian Networks and Directed Acyclic Graphs (DAGs) to map the causal dependences and risk propagations between macroeconomic and microeconomic financial market indicators, expecting to identify latent regime volatility and best decisions need when working as risk management.
Step 1)¶
This was only reading suggested article and we all readed the suggested it.
Step 2 a)¶
The problem that Alvi attemps to solve reside in the inability of tradictional econometrics models in capture the multifaceted and non-linear oil price behavior. Historically, the predictions in econometric models are based in endogenous process such as ARMA,
$$ X_t = c + \epsilon_t + \sum_{i=1}^p \phi_i X_{t-i} + \sum_{j=1}^q \theta_j \epsilon_{t-j}, $$
where c is the intercept, $\phi_i$ and $\theta_j$ are autoregressive coeficients and moving average that quantify the past influence, and $\epsilon_t$ is the stocastic error; GARCH($p,q$),
$$ \sigma_t^2 = \omega + \sum_{i=1}^{q} \alpha_i \epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_j \sigma_{t-j}^2, $$
in wich $\omega > 0$ define a variance base, $\alpha_i$ a sensibility of new shocks and $\beta_j$ the volatility persistence (Alvi 2-3). This aproach is insufficient because it's based on extrapolations of historicaly data, ignoring that the price is also is influenced by factors that include (geo)political instability , global demand and supply fundamentals, market sentiment and others factors (Alvi 35-38). The Alvi dissetation outerformthis approch by proposing the uses of Probabilistics Graph Models (PGMs), integrating macoeconomic and microeconomic data and use the causal inference to predict the multivariate market behaviour in the place of real economic events across multiple quarters.
Step 2 b)¶
The Bayesian Networks are well suited for modelling the oil price due to their ability to decompose macroeconomic factors using Directed Acyclic Graphs (DAG), $G = \langle V, E \rangle$, where the vertix $V$ represent the random variable and edges $E$ represent the causal direct dependences(Alvi 10). This approach allow to put our previous Knowledge (also known as prior) and factore the joint probability distribution, $p(x_1, \dots, x_D) = \prod_{i=1}^{D} p(x_i | pa(x_i))$, where $pa(x_i)$ are progenitors variables (eg., fathers/mothers) that exert stocastic influence on the child nodes $x_i$ (Alvi 10-11). The approach is suitable for inference of unobserved causes (H) givem the evidence (E) by Bayes Theorem, $$P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)},$$ where $P(H|E)$ is posterior probability, $P(E|H)$ is the likelihood, $P(H)$ is prior and $P(E)$ marginal probability (Alvi 9). Additionally the integration of Hidden Markov Models (HMMs) addresses the latend volatility regimes via Markov propriety,$P(X_{t+1} = j | X_t = i) = p_{ij}$, in which $p_{ij}$ is the transition probability between states $i$ and $j$ (Alvi 21-22). The HMMs help to identify hidden states $q \in Q$ that make influence in market prices, capturing nonlinear and hierarchical prices that tradictional econometrics modes fail do capture (Alvi 22-23).
Step 2c)¶
The use of Probabilistics Graph models (PGMs) outperform the linear regression by allowing the strutured learning , where algorithms such as Hill Climbing extrat casualitu of raw data data and adapt it self to dynamic changes (Alvi 14-17). The methodology mitigate the problem of dimensionality (comuation cost), by fatoring the joint distribution into local one, making it well suitabel for big Data (Alvi 13). To ensure frugality, uses the Bayesian Information Criteria, $$BIC(G, D) = \sum_{v \in V} \log \hat{p}(x_v | pa(x_v)) - \frac{1}{2} k_{v|pa(v)} \log n,$$ where $\sum \log \hat{p}$ is the sum of local logarithm likelihood of nodes $v$ condicional to the fathers $pa(x_v)$, $k_{v|pa(v)}$ is the number of independendt parameters (degree of freedom) and $n$ is the number of observations (samples)(Alvi 18-19). This penalize the excessive complexity, preventing possible overfitting by retaining only dependences with real predictive power. Finally, PGMs help to identify the macroeconomic vulnerabiliers that tradictional linear models fail do deted.
Step 3 a)¶
# 0. PACKAGE INSTALLATION
%pip install --quiet pandas numpy matplotlib seaborn scipy scikit-learn missingno yfinance pandas-datareader requests pgmpy
import warnings
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import median_abs_deviation
import requests
import yfinance as yf
import pandas_datareader.data as web
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import os
import pgmpy
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-paper')
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import pandas_datareader.data as web
import requests
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
START_DATE = "2015-01-01"
END_DATE = "2023-12-31"
FRED_API_KEY = "6071482a4efea6801348a6fd83196365"
EIA_API_KEY = "JSHoiaSh9e3PUtozc9UwHoRrSpa9xTniUZWSUPfa"
fred_series = {
"US_Ind_Production": "INDPRO",
"US_Energy_CPI": "CPIENGSL",
"Global_Econ_Policy_Uncert": "GEPUCURRENT"
}
df_macro = pd.DataFrame()
for name, ticker in fred_series.items():
df_macro[name] = web.DataReader(ticker, "fred", START_DATE, END_DATE)
def fetch_eia_data(series_id, name):
url = f"https://api.eia.gov/v2/seriesid/{series_id}?api_key={EIA_API_KEY}"
try:
response = requests.get(url).json()
dates = [row['period'] for row in response['response']['data']]
values = [row['value'] for row in response['response']['data']]
df = pd.DataFrame({name: values}, index=pd.to_datetime(dates))
return df.sort_index()
except Exception as e:
return pd.DataFrame()
eia_series = {
"OPEC_Production": "STEO.PAPR_OPEC.M",
"NonOPEC_Production": "STEO.PAPR_NONOPEC.M",
"OPEC_Spare_Capacity": "STEO.COPS_OPEC.M",
"OECD_GDP": "STEO.RGDPQ_OECD.M"
}
df_geo = pd.DataFrame()
for name, series_id in eia_series.items():
eia_data = fetch_eia_data(series_id, name)
if not eia_data.empty:
df_geo = pd.concat([df_geo, eia_data], axis=1)
df_student_A = pd.concat([df_macro, df_geo], axis=1).resample('M').last()
df_student_A = df_student_A.loc[START_DATE:END_DATE]
df_student_A.interpolate(method='linear', inplace=True)
output_filename = "student_A_macro_geo_data.csv"
df_student_A.to_csv(output_filename)
df_student_A.plot(subplots=True, figsize=(12, 10), title="Figure 1: Macro & Geopolitical Variables")
plt.tight_layout()
plt.show()
The industrial production and oil extraction (OPEC/Non-OPEC) registed problems in 2020, with Non-OPEC production showing more pronounced growth at the end of the sample date which is December 12th 2023 (Figur 1). The global political uncertainty reached high levels during the pandemic period, generating a regime change (jump) in OPEC spare capacity, which achieved the maximum peak over 8 8 units before declining gradualy (Figure 1). The Energy CPI show transition from momentary defaltion in 2020 to persistent and acelerated inflaction, which finshed with high energy cost between 2022 and 2023 (Figure 1).
Step 3 b)¶
micro_tickers = {
"Exxon_Mobil_Corp": "XOM",
"Oil_Services_ETF": "OIH",
"Delta_Airlines": "DAL"
}
df_corporate = pd.DataFrame()
for name, ticker in micro_tickers.items():
df_corporate[name] = yf.download(ticker, start=START_DATE, end=END_DATE)["Close"]
df_corporate = df_corporate.resample('M').last()
fred_micro = {"US_Vehicle_Sales": "TOTALSA"}
df_consumer = pd.DataFrame()
for name, ticker in fred_micro.items():
df_consumer[name] = web.DataReader(ticker, "fred", START_DATE, END_DATE)
df_consumer = df_consumer.resample('M').last()
df_student_B = pd.concat([df_corporate, df_consumer], axis=1)
df_student_B.interpolate(method='linear', inplace=True)
df_student_B.plot(subplots=True, figsize=(12, 8), title="Figure 2: Microeconomic Variables")
plt.tight_layout()
plt.show()
[*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed
The microeconomic variables show the shocks of 2020 with falls in Exxon Mobil, aviation setor (Delta Airlines) and in the vehicle consuption in the US (Figure 2). The Exxon Mobil present any exceptional recuperation starting from 2021, while the oil services ETF and Delta Airlines show more volatilie recovery (Figure 2). The vehicle sales plot show v-shaped during the lockdown period, serving as any important indicatory of the paralysis in demand for durable goods (Figura 2).
**Step 3 c) **¶
financial_tickers = {
"Crude_Oil_WTI": "CL=F",
"SP500_Index": "^GSPC",
"USD_Index": "DX-Y.NYB",
"Gold": "GC=F",
"Volatility_Index_VIX": "^VIX",
"High_Yield_Bonds_HYG": "HYG"
}
df_student_C = pd.DataFrame()
for name, ticker in financial_tickers.items():
df_student_C[name] = yf.download(ticker, start=START_DATE, end=END_DATE)["Close"]
df_student_C = df_student_C.resample('M').last()
df_student_C.interpolate(method='linear', inplace=True)
df_student_C.plot(subplots=True, figsize=(12, 12), title="Figure 3: Financial & Securities Variables")
plt.tight_layout()
plt.show()
[*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed
The times series analysis show sistemtic shocks in 2020 caracterized by colpse of the WTI and SP500 and explosion of the VIX to high than 50 levels (Figure 3). Its possible to see a recovery of risk and safe haven assets such as Gold post 2020, followed a sharp strengthening of USD Index from mid 2021 onwards (Figure 3). The evolution of high yield bons and oil reaffirm financial cyclic patherns where growth stable periods are interrupted by stuurutal brokes that redifine the asset prices levels.
Setp 4¶
The student A focosed on global suply and demand as well as systemic stability. The aim is to capture the information about production capacity and economic growth.
Insustrial production (INDPRO): used as proxy of high frequency in real economic growth. The Industrial sector is the principal consumer of distillates , making this variable best indictor of the shift in the demand curve.
Energy Inflation (CPIENGSL): Capture the pressure of energy prices on the consumer basket, that influenciate the inflations expectations and monetary politics dicisions.
Uncertainty of the political economicy (GEPU): Quantify the geopolitical risk because uncertainty shocks rises the risk premium of oil.
OPEC dynamics: Acts as swing producer. Productions and idle capacity are important geopolitical variables that determine the resilience to sudden supply disruptions.
def generate_data_dictionary(df, source_mapping):
"""
Generates a comprehensive metadata table for the given DataFrame.
Reflects statistical rigor by extracting actual frequency, start/end dates,
and missingness profile before the cleaning phase.
"""
# Identify inferred frequency or default to 'M' (Month End) based on prior resampling
freq = df.index.inferred_freq if df.index.inferred_freq else "M"
summary = pd.DataFrame({
"Variable": df.columns,
"Source": [source_mapping.get(col, "Unknown") for col in df.columns],
"Frequency": freq,
"Start_Date": df.index.min().strftime('%Y-%m-%d'),
"End_Date": df.index.max().strftime('%Y-%m-%d'),
"Observations": df.count().values,
"Missing_Values": df.isna().sum().values,
"Mean": df.mean().round(2).values,
"Std_Dev": df.std().round(2).values
})
return summary
source_map_A = {
"US_Ind_Production": "FRED",
"US_Energy_CPI": "FRED",
"Global_Econ_Policy_Uncert": "FRED",
"OPEC_Production": "EIA STEO",
"NonOPEC_Production": "EIA STEO",
"OPEC_Spare_Capacity": "EIA STEO",
"OECD_GDP": "EIA STEO"
}
dict_student_A = generate_data_dictionary(df_student_A, source_map_A)
display(dict_student_A)
print(f"Total Macro/Geo Dimensions: {df_student_A.shape}")
print(f"Overall Data Completeness: {((df_student_A.notna().sum().sum() / df_student_A.size) * 100):.2f}%\n")
| Variable | Source | Frequency | Start_Date | End_Date | Observations | Missing_Values | Mean | Std_Dev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | US_Ind_Production | FRED | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 100.16 | 2.89 |
| 1 | US_Energy_CPI | FRED | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 227.66 | 37.77 |
| 2 | Global_Econ_Policy_Uncert | FRED | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 208.05 | 64.68 |
| 3 | OPEC_Production | EIA STEO | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 33.39 | 2.25 |
| 4 | NonOPEC_Production | EIA STEO | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 65.16 | 2.72 |
| 5 | OPEC_Spare_Capacity | EIA STEO | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 2.78 | 1.63 |
Total Macro/Geo Dimensions: (108, 6) Overall Data Completeness: 100.00%
Student B focoused in corporative decisions, marginal costs and consumer behavior, which are important for supply and demand.
Exxon Mobil (XOM): Acts as proxy for capital (CAPEX) of major oil companies.
Oil services (OIH): This ETF reflects activity of drilling and extractions costs. It's any important indicator of shale oil viability, which marginal cost define the prices ceiling in regimes changes.
Delta Airlines (DAL): Represent the corporative consumer side. The profitability of airlines is inversely correlated with fuel costs, a measure of elasticity of demand.
Vehicle sales (TOTALSA): Acts as a measure of individual consumer behavior and demand for gasoline in the largest world consumer market.
source_map_B = {
"Exxon_Mobil_Corp": "Yahoo Finance",
"Oil_Services_ETF": "Yahoo Finance",
"Delta_Airlines": "Yahoo Finance",
"US_Vehicle_Sales": "FRED"
}
dict_student_B = generate_data_dictionary(df_student_B, source_map_B)
display(dict_student_B)
print(f"Total Micro Dimensions: {df_student_B.shape}")
print(f"Overall Data Completeness: {((df_student_B.notna().sum().sum() / df_student_B.size) * 100):.2f}%\n")
| Variable | Source | Frequency | Start_Date | End_Date | Observations | Missing_Values | Mean | Std_Dev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Exxon_Mobil_Corp | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 59.64 | 19.15 |
| 1 | Oil_Services_ETF | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 332.68 | 145.97 |
| 2 | Delta_Airlines | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 41.44 | 8.00 |
| 3 | US_Vehicle_Sales | FRED | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 16.56 | 1.77 |
Total Micro Dimensions: (108, 4) Overall Data Completeness: 100.00%
Student C, analysis the interaions between oil with others assets and credite market.
WTI Crude Oil (CL=F): The dependent variable and principal contact of reference for prices descovery.
S&P 500 (^GSPC): Capture the global risk and overall health market influencing the investiments flows in commodity.
Dollar Index (DX-Y.NYB): As oil ins priced in dollars , there're inverse historical correlation.
VIX and High Yield Bonds (HYG): VIX measure market fear, while HYG serves as proxy for credit availability.
source_map_C = {
"Crude_Oil_WTI": "Yahoo Finance",
"SP500_Index": "Yahoo Finance",
"USD_Index": "Yahoo Finance",
"Gold": "Yahoo Finance",
"Volatility_Index_VIX": "Yahoo Finance",
"High_Yield_Bonds_HYG": "Yahoo Finance"
}
dict_student_C = generate_data_dictionary(df_student_C, source_map_C)
print("--- STUDENT C: FINANCIAL METADATA ---")
display(dict_student_C)
print(f"Total Financial Dimensions: {df_student_C.shape}")
print(f"Overall Data Completeness: {((df_student_C.notna().sum().sum() / df_student_C.size) * 100):.2f}%\n")
--- STUDENT C: FINANCIAL METADATA ---
| Variable | Source | Frequency | Start_Date | End_Date | Observations | Missing_Values | Mean | Std_Dev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Crude_Oil_WTI | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 60.61 | 18.07 |
| 1 | SP500_Index | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 3132.82 | 878.20 |
| 2 | USD_Index | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 97.42 | 4.63 |
| 3 | Gold | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 1516.54 | 301.60 |
| 4 | Volatility_Index_VIX | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 18.81 | 7.30 |
| 5 | High_Yield_Bonds_HYG | Yahoo Finance | ME | 2015-01-31 | 2023-12-31 | 108 | 0 | 58.46 | 5.97 |
Total Financial Dimensions: (108, 6) Overall Data Completeness: 100.00%
print("Dictionary of selected variables")
| Category | Variable | Ticker / ID | Source | Frequency | Unity | Transformation |
|---|---|---|---|---|---|---|
| Macro/Geo | Industrial production | INDPRO | FRED | Montly | Índice | % Change |
| Macro/Geo | CPI Energy | CPIENGSL | FRED | Montly | Índice | % Change |
| Macro/Geo | Geopolitical uncertainty | GEPUCURRENT | FRED | Montly | Índice | Level |
| Macro/Geo | OPEC Production | PAPR_OPEC | EIA | Montly | mb/d | Change |
| Macro/Geo | On-OPEC production | PAPR_NONOPEC | EIA | Montly | mb/d | Change |
| Macro/Geo | Idle capcity | COPS_OPEC | EIA | Montly | mb/d | Level |
| Macro/Geo | PIB OCDE | RGDPQ_OECD | EIA | Montly | Billions $ | % Change |
| Micro | Exxon Mobil | XOM | Yahoo Fin. | Montly | USD | Log-Return |
| Micro | Oil servicess | OIH | Yahoo Fin. | Montly | USD | Log-Return |
| Micro | Vehicle seles | TOTALSA | FRED | Montly | Millions | % Change |
| Micro | Delta Airlines | DAL | Yahoo Fin. | Montly | USD | Log-Return |
| Financeiro | WTI Crude Oil | CL=F | Yahoo Fin. | Montly | USD/bbl | Log-Return |
| Financeiro | S&P 500 | ^GSPC | Yahoo Fin. | Montly | Index | Log-Return |
| Financeiro | Dollar Index | DX-Y.NYB | Yahoo Fin. | Montly | Index | % Change |
| Financeiro | Gold | GC=F | Yahoo Fin. | Montly | USD/oz | Log-Return |
| Financeiro | VIX Index | ^VIX | Yahoo Fin. | Montly | % | Level |
| Financeiro | High Yield Bonds | HYG | Yahoo Fin. | Montly | USD | % Change |
Setp 5a)¶
Here we focoused on identifying outliers and heavy tails. For univariate analisis of WTI, we applied Hampel filer (window of 12 months) via Mideian Absolute Desviation, $MAD = \text{mediana}(|X_t - \tilde{X}|)$, where $\tilde{X}$ is moving median. The annomalis classification ($Z_{robust} > 3.5$) uses the robust Z-Score, $Z_{robust} = \frac{|X_t - \tilde{X}|}{k \cdot MAD}$, where $k \approx 1.4826$ is sclae factor for normality. Inmultivariate side, we used Minimum Covariance Determinante (MCD) to estimate the mean vector $\hat{\mu}_{MCD}$ and covariance matrix $\hat{\Sigma}_{MCD}$, allowing to calculate robust Mahalanobis distance, $D_M(x_t) = \sqrt{(x_t - \hat{\mu}_{MCD})^T \hat{\Sigma}_{MCD}^{-1} (x_t - \hat{\mu}_{MCD})}$. Using eliptic envelop (5% of contamiation), we insolated the systemic shocks which although extreme, are not eliminated in the sata base because capture market asymmetries essential for modeling regimes changes in Bayesian Neworks and Hidden Markov Models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf # Required for tickers used in Student B/C
from scipy.stats import median_abs_deviation
from sklearn.covariance import EllipticEnvelope
df_master = pd.concat([df_student_A, df_student_B, df_student_C], axis=1).dropna()
def detect_univariate_outliers_mad(series, window=6, threshold=3.0):
rolling_median = series.rolling(window=window, center=True).median()
rolling_mad = series.rolling(window=window, center=True).apply(
lambda x: median_abs_deviation(x[~np.isnan(x)], scale='normal') if len(x[~np.isnan(x)]) > 0 else 0
)
robust_z_score = np.abs(series - rolling_median) / (rolling_mad + 1e-8)
outliers = robust_z_score > threshold
return outliers, rolling_median
def detect_multivariate_outliers_mcd(df, contamination=0.05):
df_clean = df.dropna()
mcd = EllipticEnvelope(contamination=contamination, random_state=42)
mcd.fit(df_clean)
preds = mcd.predict(df_clean)
outlier_dates = df_clean.index[preds == -1]
return outlier_dates
target_var = 'Crude_Oil_WTI'
multivariate_outliers = detect_multivariate_outliers_mcd(df_master)
if target_var in df_master.columns:
outlier_mask, trend = detect_univariate_outliers_mad(df_master[target_var])
fig, ax = plt.subplots(figsize=(14, 7))
ax.plot(df_master.index, df_master[target_var], label='WTI Price', color='steelblue', alpha=0.5, lw=2)
ax.plot(df_master.index, trend, label='Rolling Median (Trend)', color='darkorange', linestyle='--', alpha=0.8)
if outlier_mask.any():
ax.scatter(df_master.index[outlier_mask], df_master[target_var][outlier_mask],
color='crimson', label='Univariate Outliers (MAD)', zorder=5, marker='x', s=120)
if len(multivariate_outliers) > 0:
ax.scatter(multivariate_outliers, df_master.loc[multivariate_outliers, target_var],
edgecolor='blue', facecolor='none', s=200, label='Multivariate Outliers (MCD)',
zorder=4, linewidth=2)
ax.set_title(f"Figure 4: Comparative Outlier Detection (Univariate vs. Multivariate) - {target_var}")
ax.set_ylabel("Price / Value")
ax.legend(loc='best', frameon=True)
plt.grid(True, alpha=0.3)
plt.show()
print(f"Master DataFrame dimensions: {df_master.shape}")
print(f"Dates identified as Multivariate Outliers:\n{multivariate_outliers.strftime('%Y-%m-%d').tolist()}")
Master DataFrame dimensions: (108, 16) Dates identified as Multivariate Outliers: ['2020-03-31', '2020-04-30', '2020-05-31', '2020-06-30', '2020-10-31', '2021-02-28']
The detection via MCD identify multivariate anomalies concentrated in 2020 and begining of 2021 whre the price show mediam move (Figure 4). This outliers represent strutural disruptions and moments of systematic stress that deviate from the standard historical market behavior of oil.
Step 5b)¶
Here we separete expurios observations using three vectors: (1) Temporal integrity eliminating duplications to avoid look-ahead bias; (2) Detection of stale prices, where liquidy fail, using stard desviation in moving windows of 3 months ($w=3$) for nulo, $$\sigma_{t, w} = \sqrt{\frac{1}{w-1} \sum_{i=0}^{w-1} (X_{t-i} - \bar{X}_t)^2} = 0;$$ and (3) domine restrictions validations, that impose non-negativity $X_{i,t} \ge 0, \forall t$ in macro and microeconomic variables. The model give exception to the future WTI in 2020 April, recognising negative prices as market anomalies instand of tecnical errors. To predict temporal continuity and covariance matrix the anomalies observations where not exclued, but transformed in NAN values ($X_t \to \text{NA}$), so that student C can make imputation (See step 5c).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
def audit_data_integrity(df):
report = {}
df = df.sort_index()
dup_idx = df.index.duplicated().sum()
report['Duplicated_Timestamps'] = dup_idx
stale_data_mask = (df.diff().abs().rolling(window=2).sum() == 0)
stale_counts = stale_data_mask.sum()
report['Stale_Data_Points'] = stale_counts[stale_counts > 0].to_dict()
positive_patterns = ['INDPRO', 'PAPR', 'TOTALSA', 'RGDP', 'CPI']
neg_violations = {}
for col in df.columns:
if "Crude_Oil_WTI" in col:
continue
if any(p in col for p in positive_patterns):
neg_count = (df[col] < 0).sum()
if neg_count > 0:
neg_violations[col] = neg_count
report['Negative_Value_Violations'] = neg_violations
return report, stale_data_mask
integrity_report, stale_mask = audit_data_integrity(df_master)
df_master = df_master[~df_master.index.duplicated(keep='last')].sort_index()
for col, count in integrity_report['Stale_Data_Points'].items():
df_master.loc[stale_mask[col], col] = np.nan
for key, val in integrity_report.items():
print(f"{key}: {val}")
cols_with_stale = [col for col in df_master.columns if stale_mask[col].any()]
if cols_with_stale:
fig, ax = plt.subplots(figsize=(14, 4))
sns.heatmap(stale_mask[cols_with_stale].T, cmap='YlOrRd', cbar=False, ax=ax)
ax.set_title("Figure 5: Heatmap of Stale Data Points (Sinalized for Imputation)")
plt.show()
Duplicated_Timestamps: 0
Stale_Data_Points: {}
Negative_Value_Violations: {}
Step 5c)¶
Here we didn't maked bfill to avoid violate the Markove propriate, $P(X_{t+1} | X_t, X_{t-1}, \dots) = P(X_{t+1} | X_t)$, which can invalidate the casuality in Bayesian Networks. We used Piecewise Cubic Hermite Interpolating Polynomial for short columns preserving local monotonicity, and Multiple Imputation by Chained Equations (MICE) with Bayesian ridge regression for strutural omissions. The model estimate parameters $\theta$ using Bayes Theorem, $p(\theta | y, X) \propto p(y | X, \theta) p(\theta)$, assuming that prior disribution are Gaussian (Normal distribution) so that the imputed values can not distort the covariance global matrix and joint probability distribution.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import matplotlib.gridspec as gridspec
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import warnings
warnings.filterwarnings('ignore')
fig = plt.figure(figsize=(16, 7))
gs = gridspec.GridSpec(1, 2, width_ratios=[1.2, 1])
ax0 = plt.subplot(gs[0])
msno.bar(df_master, ax=ax0, color="slategray", fontsize=10, labels=True)
ax0.set_title("A. Feature Completeness Analysis (%)", loc='left', fontsize=14, fontweight='bold')
ax1 = plt.subplot(gs[1])
msno.dendrogram(df_master, ax=ax1, orientation='right', fontsize=10)
ax1.set_title("B. Nullity Hierarchy (Correlation Clusters)", loc='left', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
def impute_financial_data(df):
df_imputed = df.copy()
df_imputed = df_imputed.interpolate(method='pchip', limit=2, limit_direction='both')
if df_imputed.isna().sum().sum() > 0:
imputer = IterativeImputer(
estimator=BayesianRidge(),
max_iter=30,
random_state=42,
tol=1e-3
)
time_idx = np.arange(len(df_imputed)).reshape(-1, 1)
imputation_matrix = np.hstack([time_idx, df_imputed.values])
imputed_array = imputer.fit_transform(imputation_matrix)
df_imputed = pd.DataFrame(imputed_array[:, 1:], index=df.index, columns=df.columns)
pos_cols = [c for c in df_imputed.columns if any(sub in c for sub in ['Production', 'Capacity', 'Sales', 'GDP'])]
df_imputed[pos_cols] = df_imputed[pos_cols].clip(lower=0)
return df_imputed
df_clean_final = impute_financial_data(df_master)
sample_col = 'OPEC_Spare_Capacity'
if sample_col in df_master.columns:
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(df_master.index, df_master[sample_col], color='steelblue',
linewidth=2.5, label='Original Observed Data', alpha=0.5)
ax.plot(df_clean_final.index, df_clean_final[sample_col], color='crimson',
linestyle='--', linewidth=1.5, label='Final Imputed Trajectory (MICE/PCHIP)')
null_mask = df_master[sample_col].isna()
if null_mask.any():
ax.scatter(df_clean_final.index[null_mask], df_clean_final.loc[null_mask, sample_col],
color='gold', zorder=5, label='Imputed Points (Validation)', s=40)
ax.set_title(f"Figure 7: Imputation Integrity Check - {sample_col}", fontsize=14, fontweight='bold')
ax.set_ylabel("Standardized Value")
ax.legend(frameon=True, loc='best')
plt.grid(True, alpha=0.3)
plt.show()
print(f"Post-Imputation Audit: {df_clean_final.isna().sum().sum()} missing values remaining.")
Post-Imputation Audit: 0 missing values remaining.
As we see in Figure A and B all dataset archivied 100% of impuations for 108 observations. We can see in the Figure 7 that the imputed data or OPEC overlaps with original data, presrrving consistency.
Step 6¶
The sterilized dataset ensure strutural integrity of bayesian Network.
def sterilize_dataset(df: pd.DataFrame, target_var: str = 'Crude_Oil_WTI') -> pd.DataFrame:
"""
Consolidates the data sterilization protocol combining structural integrity checks,
hybrid Bayesian imputation, and robust outlier winsorization.
Optimized for Markovian and Bayesian Network econometric modeling.
"""
df_clean = df.copy()
df_clean = df_clean[~df_clean.index.duplicated(keep='last')]
stale_mask = df_clean.rolling(window=3).std() == 0
df_clean[stale_mask] = np.nan
strictly_positive_vars = ['INDPRO', 'OPEC_Production', 'NonOPEC_Production', 'OECD_GDP', 'US_Vehicle_Sales']
for col in df_clean.columns:
if any(var in col for var in strictly_positive_vars):
df_clean.loc[df_clean[col] < 0, col] = np.nan
predictors = [col for col in df_clean.columns if col != target_var]
for col in predictors:
rolling_median = df_clean[col].rolling(window=12, center=True, min_periods=1).median()
rolling_mad = df_clean[col].rolling(window=12, center=True, min_periods=1).apply(
lambda x: median_abs_deviation(x[~np.isnan(x)], scale='normal')
)
epsilon = 1e-8
upper_bound = rolling_median + (5 * (rolling_mad + epsilon))
lower_bound = rolling_median - (5 * (rolling_mad + epsilon))
df_clean[col] = np.where(df_clean[col] > upper_bound, upper_bound, df_clean[col])
df_clean[col] = np.where(df_clean[col] < lower_bound, lower_bound, df_clean[col])
df_clean = df_clean.interpolate(method='pchip', limit=2, limit_direction='forward')
if df_clean.isna().sum().sum() > 0:
imputer = IterativeImputer(
estimator=BayesianRidge(),
max_iter=20,
random_state=42,
tol=1e-3
)
time_idx = np.arange(len(df_clean)).reshape(-1, 1)
imp_matrix = np.hstack([time_idx, df_clean.values])
imputed_vals = imputer.fit_transform(imp_matrix)
df_clean = pd.DataFrame(imputed_vals[:, 1:], index=df_clean.index, columns=df_clean.columns)
return df_clean
try:
df_master = pd.concat([df_student_A, df_student_B, df_student_C], axis=1)
df_sterilized = sterilize_dataset(df_master, target_var='Crude_Oil_WTI')
print("--- STERILIZATION PIPELINE COMPLETED ---")
print(f"Original Dimensions: {df_master.shape}")
print(f"Sterilized Dimensions: {df_sterilized.shape}")
print(f"Remaining NaNs: {df_sterilized.isna().sum().sum()}\n")
print("--- STERILIZED DATASET PREVIEW (HEAD & TAIL) ---")
df_preview = pd.concat([df_sterilized.head(5), df_sterilized.tail(5)])
display(df_preview)
except NameError:
print("Error: df_student_A, df_student_B, or df_student_C not found in the current namespace.")
--- STERILIZATION PIPELINE COMPLETED --- Original Dimensions: (108, 16) Sterilized Dimensions: (108, 16) Remaining NaNs: 0 --- STERILIZED DATASET PREVIEW (HEAD & TAIL) ---
| US_Ind_Production | US_Energy_CPI | Global_Econ_Policy_Uncert | OPEC_Production | NonOPEC_Production | OPEC_Spare_Capacity | Exxon_Mobil_Corp | Oil_Services_ETF | Delta_Airlines | US_Vehicle_Sales | Crude_Oil_WTI | SP500_Index | USD_Index | Gold | Volatility_Index_VIX | High_Yield_Bonds_HYG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015-01-31 | 102.8905 | 199.926 | 129.058627 | 33.053219 | 62.191976 | 2.380322 | 54.180454 | 547.691101 | 41.315250 | 16.910 | 48.240002 | 1994.989990 | 94.800003 | 1255.190913 | 20.969999 | 49.649830 |
| 2015-02-28 | 102.2335 | 203.021 | 115.615595 | 32.978743 | 62.133020 | 2.279944 | 55.287643 | 575.075623 | 38.957119 | 16.891 | 49.759998 | 2104.500000 | 95.250000 | 1212.599976 | 13.340000 | 50.757572 |
| 2015-03-31 | 101.8914 | 205.757 | 111.378145 | 33.770899 | 62.370155 | 1.879726 | 53.077141 | 556.104431 | 39.342129 | 17.896 | 47.599998 | 2067.889893 | 98.360001 | 1183.099976 | 15.290000 | 50.276520 |
| 2015-04-30 | 101.3355 | 203.577 | 100.978142 | 33.961862 | 62.285572 | 1.860331 | 54.557053 | 644.031799 | 39.062115 | 17.693 | 59.630001 | 2085.510010 | 94.599998 | 1182.400024 | 14.550000 | 50.714561 |
| 2015-05-31 | 100.8883 | 208.973 | 104.462104 | 34.198546 | 62.192864 | 1.600471 | 53.642059 | 604.439758 | 37.634182 | 17.929 | 60.299999 | 2107.389893 | 96.910004 | 1189.400024 | 13.840000 | 50.894852 |
| 2023-08-31 | 100.8393 | 287.310 | 175.505155 | 32.015400 | 69.925825 | 4.352000 | 102.095436 | 323.572021 | 41.689693 | 16.071 | 83.629997 | 4507.660156 | 103.620003 | 1938.199951 | 13.570000 | 64.311607 |
| 2023-09-30 | 101.0211 | 292.623 | 200.638478 | 32.695200 | 70.348026 | 3.870000 | 107.962784 | 327.665009 | 35.972923 | 16.209 | 90.790001 | 4288.049805 | 106.169998 | 1848.099976 | 17.520000 | 63.273911 |
| 2023-10-31 | 100.4689 | 286.765 | 190.443809 | 32.664200 | 70.633616 | 4.039000 | 97.192207 | 308.064331 | 30.467567 | 15.738 | 81.019997 | 4193.799805 | 106.660004 | 1985.199951 | 18.139999 | 62.613869 |
| 2023-11-30 | 100.8639 | 282.146 | 207.972104 | 32.768800 | 71.317001 | 4.029000 | 95.199226 | 296.573669 | 36.005360 | 15.782 | 75.959999 | 4567.799805 | 103.500000 | 2038.099976 | 12.920000 | 65.674210 |
| 2023-12-31 | 100.6031 | 280.763 | 223.023700 | 32.758900 | 71.323795 | 4.228000 | 92.641808 | 297.991425 | 39.222725 | 16.321 | 71.650002 | 4769.830078 | 101.330002 | 2062.399902 | 12.450000 | 67.770882 |
Certain data points and temporal periods was eliminated in the model to ensure strutural integrity and stocastic validation required by Bayesian Network leaning. Initialy data with duplicated temporal where eliminated to ensure cronological time serie, indispensable in transition matrix and Markov isolation. Stagneted data that showed null variance in consecutives windowns was converted to null vlaues because thet behavior reflect capture fails or freeze in institutional feeds that maintained, would create spurius dependencies and bieas the model's transition probabilities. Observations that violated basic economic axioms, removing negatives values in stictly positive support variables, such as OPEC and industrial productions index, preserving only exclusion of WTI oil price to acommodete the legitime collapse observed in April 2020.
Regarding severe univariate outliers that exceeded the robust limite absolute desvations from median (MAP) described above, we chosed not to remove entire dates to avoid breaking multivariate temporal continuity. Instand, that's outliers was submited to winsorization technique being coercively contained with acceptable upper and lower limits.
This approach prevent market anomalies and extreme noise in the covariance matrix and maximum likelihood estimates during causal parameters calibrations.
Step 7 a)¶
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler
plt.style.use('seaborn-v0_8-paper')
sns.set_theme(style="ticks", context="paper", font_scale=1.2)
plt.rcParams.update({
'figure.dpi': 300,
'axes.titleweight': 'bold',
'axes.labelweight': 'bold',
'lines.linewidth': 1.5
})
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_clean_final),
index=df_clean_final.index,
columns=df_clean_final.columns)
def plot_student_A_distributions(df, df_scaled, target_var='Crude_Oil_WTI'):
fig = plt.figure(figsize=(16, 12))
ax1 = fig.add_subplot(2, 2, 1)
target_data = df[target_var].dropna()
sns.histplot(target_data, stat="density", bins=30, color="steelblue", alpha=0.6, ax=ax1)
sns.kdeplot(target_data, color="darkblue", linewidth=2, label="Empirical KDE", ax=ax1)
mu, std = stats.norm.fit(target_data)
xmin, xmax = ax1.get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
ax1.plot(x, p, 'k--', linewidth=2, label=f'Normal Fit ($\mu$={mu:.2f}, $\sigma$={std:.2f})')
ax1.set_title(f"Figure 8a: Empirical vs. Normal Distribution - {target_var}")
ax1.legend()
ax2 = fig.add_subplot(2, 2, 2)
n = len(target_data)
sorted_data = np.sort(target_data)
z_empirical = (sorted_data - mu) / std
p_vals = (np.arange(1, n + 1) - 0.5) / n
q_theoretical = stats.norm.ppf(p_vals)
deviation = z_empirical - q_theoretical
se = np.sqrt((p_vals * (1 - p_vals)) / (n * (stats.norm.pdf(q_theoretical)**2)))
ci_upper = 1.96 * se
ci_lower = -1.96 * se
ax2.scatter(q_theoretical, deviation, color='steelblue', s=15, alpha=0.8, label='Empirical Deviation')
ax2.plot(q_theoretical, ci_upper, color='crimson', linestyle='--', linewidth=1.5, label='95% CI Band')
ax2.plot(q_theoretical, ci_lower, color='crimson', linestyle='--', linewidth=1.5)
ax2.axhline(0, color='black', linewidth=1)
ax2.set_title(f"Figure 8b: Worm Plot (Detrended Q-Q) - {target_var}")
ax2.set_xlabel("Theoretical Quantiles")
ax2.set_ylabel("Deviation")
ax2.legend(loc='best')
ax3 = fig.add_subplot(2, 1, 2)
key_vars = ['INDPRO', 'GEPUCURRENT', 'OPEC_Spare_Capacity', 'XOM', 'Volatility_Index_VIX']
available_vars = [v for v in key_vars if v in df_scaled.columns]
if available_vars:
df_melted = df_scaled[available_vars].melt(var_name='Variable', value_name='Standardized Value')
sns.violinplot(x='Standardized Value', y='Variable', data=df_melted,
palette="muted", inner="quartile", ax=ax3, scale="width")
ax3.set_title("Figure 8c: Cross-Sectional Distribution Density (Standardized)")
ax3.set_xlabel("Z-Score")
ax3.set_ylabel("")
sns.despine()
plt.tight_layout()
plt.show()
plot_student_A_distributions(df_clean_final, df_scaled)
The Crude Oil WTI distribution show any multimodal behavior and asymmetrical shape that deviates from normal distribution, but not critical (Figure 8a). Tha's, if we don't need to be very rigid we can assume normality. This is validated by Worm plot (Detrended QQ-plot), whose few points extrapolate the 95% of confidence bands in theoretical quantiles (Figure 8b).
Finally, standardized density of indicators such as VIX and OPEC show right skewness, evidencing the frequent occurrence of extreme volatilitu shocks (Figure 8c).
Step 7 b)¶
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
def plot_student_B_timeseries(df, df_scaled, target_var='Crude_Oil_WTI'):
fig = plt.figure(figsize=(16, 14))
ax1 = fig.add_subplot(3, 1, 1)
key_drivers = [col for col in ['US_Ind_Production', 'OPEC_Production', 'SP500_Index', target_var] if col in df_scaled.columns]
for col in key_drivers:
linewidth = 2.5 if col == target_var else 1.2
alpha = 1.0 if col == target_var else 0.6
ax1.plot(df_scaled.index, df_scaled[col], label=col, linewidth=linewidth, alpha=alpha)
ax1.axvspan(pd.to_datetime('2020-02-01'), pd.to_datetime('2020-05-01'), color='crimson', alpha=0.15, label='COVID-19 Shock')
ax1.set_title("Figure 9a: Synchronized Regime Shifts - Standardized Macro-Financial Dynamics")
ax1.set_ylabel("Z-Score")
ax1.legend(loc='upper left', ncol=5, frameon=False)
ax2 = fig.add_subplot(3, 1, 2)
ax2.plot(df.index, df[target_var], color='black', label=f'{target_var} (Spot)', linewidth=1.5)
ax2.plot(df.index, df[target_var].rolling(6).mean(), color='darkorange', linestyle='--', label='6-Month SMA')
ax2.plot(df.index, df[target_var].rolling(12).mean(), color='steelblue', linestyle='-.', label='12-Month SMA')
ax2.set_title("Figure 9b: Long-Term Trend Extraction and Moving Averages")
ax2.set_ylabel("USD / bbl")
ax2.legend(loc='upper left')
ax3 = fig.add_subplot(3, 1, 3)
log_returns = np.log(df[target_var] / df[target_var].shift(1))
rolling_vol = log_returns.rolling(window=3).std() * np.sqrt(12)
ax3.plot(rolling_vol.index, rolling_vol, color='firebrick', linewidth=1.5)
ax3.fill_between(rolling_vol.index, rolling_vol, color='firebrick', alpha=0.2)
ax3.set_title("Figure 9c: Volatility Clustering (Annualized 3-Month Rolling Std Dev)")
ax3.set_ylabel("Annualized Volatility")
ax3.set_xlabel("Date")
sns.despine()
plt.tight_layout()
plt.show()
plot_student_B_timeseries(df_clean_final, df_scaled)
The Z-scores analysis show that COVID-19 shocks in 2020 caused an abrupt drop in US oil and industrial production, altering macroeconomix regimes (Figure 9a). The price tracking by 6th and 12th month moving average show the cyclical recovery of assets which surpassed 100 USD/bbl in 2022 after hitting lows close 20 USD/bbl at the start of pandemic (Figure 9b).
The annulaised volatility highlights the clustering with peak aproximatly close to 2.5 during the pandemic lockdown, followed by residual instability (Figure 9c).
Setp 7 c)¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import median_abs_deviation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
try:
df_master = pd.concat([df_student_A, df_student_B, df_student_C], axis=1)
df_sterilized = sterilize_dataset(df_master, target_var='Crude_Oil_WTI')
print("--- STERILIZATION PIPELINE COMPLETED ---")
except NameError as e:
print(f"Erro de Definição: {e}. Verifique se df_student_A, B e C foram criados.")
def plot_student_C_multivariate(df, target_var='Crude_Oil_WTI'):
corr_matrix = df.corr(method='spearman')
clustermap = sns.clustermap(corr_matrix, cmap='coolwarm', center=0,
annot=True, fmt=".2f", annot_kws={"size": 7},
linewidths=0.5, figsize=(12, 10),
cbar_kws={'label': 'Spearman Correlation'})
plt.setp(clustermap.ax_heatmap.get_xticklabels(), rotation=45, ha='right')
clustermap.fig.suptitle("Figure 10a: Hierarchical Clustered Correlation Matrix (Spearman Rank)", y=1.02, fontweight='bold')
plt.show()
usd_col = [col for col in df.columns if 'USD' in col]
if usd_col:
usd_col = usd_col[0]
fig, ax = plt.subplots(figsize=(14, 4))
rolling_rank_target = df[target_var].rolling(12).rank()
rolling_rank_usd = df[usd_col].rolling(12).rank()
rolling_corr = rolling_rank_target.rolling(12).corr(rolling_rank_usd)
ax.plot(rolling_corr.index, rolling_corr, color='purple', linewidth=2)
ax.axhline(0, color='black', linestyle='--', linewidth=1)
ax.fill_between(rolling_corr.index, 0, rolling_corr, where=(rolling_corr>0), color='green', alpha=0.3)
ax.fill_between(rolling_corr.index, 0, rolling_corr, where=(rolling_corr<0), color='red', alpha=0.3)
ax.set_title(f"Figure 10b: Dynamic Regime Shift - 12-Month Rolling Spearman Correlation")
ax.set_ylabel("Spearman $\\rho$")
sns.despine()
plt.tight_layout()
plt.show()
top_corr_var = corr_matrix[target_var].drop(target_var).abs().idxmax()
joint = sns.jointplot(x=top_corr_var, y=target_var, data=df,
kind="kde", fill=True, cmap="mako", height=7, space=0)
joint.plot_marginals(sns.histplot, color="darkblue", bins=25)
joint.fig.suptitle(f"Figure 10c: Bivariate Joint Probability Density ({target_var} vs {top_corr_var})", y=1.02, fontweight='bold')
plt.show()
plot_student_C_multivariate(df_sterilized)
--- STERILIZATION PIPELINE COMPLETED ---
The correlation matrix identifies strong link (0,94) between oil and Energy CPI, while OPEC production as inversely proportional relationship (-0.81) with idle capacity (Figure 10a).
The regime dynamics show that the correlation is not static , exhibiting oscilations that alternate between mutual support and divergence, especialçy visible in 2021 (Figure 10b).
The joint distribution show that the WTI- Energy CPI relationship is not linear, but compused by multle densitys cores wich suggest existence of different states of equilibrium or regimes changes in marked (Figure 10c).
Step 8a)¶
What differentiates the oil price and others financial assets is it extreme sensibility to geopolitical ans strutural shocks, that generated breaks in time series as observed in during the COVID-19 shock in 2020 (Figure 9a). Different from assets with more linear growth, the oil show clustered colatility, where periods of stability are interrupted by variance explosion during long time, as showed in rolling standard desviation peaks (Figure 9c). Future more, the asset show some dynamic regime changes, in which its correlation with another idicators such as industrial production or SP500 index are not constant and undergoes abrupt reversals during time (Figure 10b).
Step 8b)¶
The oil distribution disviate from normal distribution (but not sevare), presenting asymmetrical shape (Figure 8a). The Worm Plot confirmed that there are short desviation but not so big, as may values are inside 95% of confidence bands, however some values are outside, which indicate the resence of extreme events and heavy tails (Figure 8b).
Addictionaly exogenous variables that make influence on the price such as OPEC and VIX index show right heavy tails in its distributions (Figure 8c).
Step 8c)¶
The data suggest a persistent temporal and long term dependence, where currect price is influenciated by pst trends, as demostreted by moving averages of 6th and 12th months (Figure 9b). Althout the short term returns seam to be random, its magnitude ahve postive autocorretalion, resulting in clustered volatility (Figure 9c). This strured dependence is also observed in the form that oil move in sync with the Energy CPI, maintaining the co-moviment during almost the entire sample time (Figure 10a).
Step 8d)¶
The others stylized facts include extremely high correlation (0,94) between Crude Oil WTI and the US Energy CPI, revealing the direct dependence of energetic inflation on the barrel price (Figure 10a). The analysis of joint bivariate density show that there no linear relationship , but rather composed of different probability cores that suggest that the market operate in multiples equilibrium states (Figure 10c). Finally, it's possible to see that the relationship between oil and sfe-haven assets (such as Gold) or risk (such as SP500) is instable, altering from regime of positive correlations and negative (Figure 10b).
Step 9a)¶
The Graph pribabilistic models represent the convergence between graph and probability theory, that help us to model multidimentionals systems of uncertainty. The advantages of this models is its capacity in encode conditional independence between hundredes of macoreconomic and microeconomic variables, mitigating the computational cost assocated with calculating multivariate joint distributions.
In this approach there concepts such as belief network and Markov network. Belief network frequently called Bayesian network uses the Directed Acyclic Graphs (DAG) to capture asymmetric and directional dependences, revealling the causality. In a DAG, the joint probability of system being modelled is factored bt product of the joint conditional probability of each node, given its set of parents (also called fathers of mothers) as following: $$P(X_1, \dots, X_n) = \prod_{i=1}^{n} P(X_i | Pa(X_i)).$$
This is needed if we need to map exogenous shocks ,such as flutuations in OPEC idle capacity, that propagate the effects asymmetrically to commodity price formation.
On other hand, Markov network operate on undirected graphs , being struturally designed to model spatial spatial correlations or symmetrical relationships where causal directionality is ambiguous or cyclical feedback driven.
The joint distribution in the Markov Network does not derive from parental relations, but rather from the product of positive potentail functions defined on maximal cliques of the graph normalized by a global partition constant $P(X) = \frac{1}{Z} \prod_{c \in C} \phi_c(X_c)$. While the directed approach imposes sequential casuality for price formation. the undirected approch accommodates pure reciprocal general equilibrium interations.
Step 9 b)¶
from pgmpy.estimators import HillClimbSearch, MaximumLikelihoodEstimator, BIC
from pgmpy.models import DiscreteBayesianNetwork
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')
df_discrete = pd.DataFrame(
discretizer.fit_transform(df_clean_final),
columns=df_clean_final.columns,
index=df_clean_final.index
).astype(int)
hc_search = HillClimbSearch(df_discrete)
scoring_method = BIC(df_discrete)
best_model_structure = hc_search.estimate(scoring_method=scoring_method, show_progress=False)
causal_dag = DiscreteBayesianNetwork(best_model_structure.edges())
causal_dag.fit(df_discrete, estimator=MaximumLikelihoodEstimator)
fig, ax = plt.subplots(figsize=(14, 10))
G = nx.DiGraph(causal_dag.edges())
pos = nx.spring_layout(G, k=1.2, seed=42)
nx.draw_networkx_nodes(G, pos, node_size=3000, node_color='#4682B4', alpha=0.8, ax=ax)
nx.draw_networkx_edges(G, pos, edge_color='gray', arrows=True, arrowsize=20, width=1.5, ax=ax)
nx.draw_networkx_labels(G, pos, font_size=9, font_weight='bold', ax=ax)
ax.set_title("Figure 11: Discovered Causal Structure (DAG) for Global Oil Market", fontsize=15)
plt.axis('off')
plt.show()
The DAG ilustrate the hierarchy causal influence discovered, where directional arrows estabilish the precedence between macroeconomics and market indicators (Figure 11). Here we see that Exxon Mobil Corp acts as central transmission node affecting dollar Index and market volatility (VIX), while Crude Oil WTI emerges as an essentail mediator that receives influency of assets such as Gold and propagates shocks to Energy CPI.
Step 9 c)¶
The MArkov theory as fundations of modelling dynamic stocastic process by introdussing the propriety of memoryless, meaning that the future evolutions of the complex system is conditionally dependent only on its strictly current state $$P(X_{t+1} | X_t, X_{t-1}, \dots, X_1) = P(X_{t+1} | X_t).$$ This allow to model transistion matrix of regimes changes , capturing the transmutations probabilities without suffer from computation cost.
On other hand Markov Blanket represent the conditional insolation kernel of individual node inside the Bayesian network. This can be seam as geometric subset formed by its parents nodes, child nodes and its spouses (other parent nodes shared by the same childs). Conditional only this nodes, the target variable becomes stocastically independente of all others notes that inhabite in the graph.
from pgmpy.estimators import HillClimbSearch, MaximumLikelihoodEstimator
from pgmpy.models import DiscreteBayesianNetwork
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')
df_discrete = pd.DataFrame(
discretizer.fit_transform(df_clean_final),
columns=df_clean_final.columns,
index=df_clean_final.index
).astype(int)
hc_search = HillClimbSearch(df_discrete)
best_model_structure = hc_search.estimate(scoring_method='bic-g', show_progress=False)
causal_dag = DiscreteBayesianNetwork(best_model_structure.edges())
causal_dag.fit(df_discrete, estimator=MaximumLikelihoodEstimator)
target_variable = 'Crude_Oil_WTI'
if target_variable in causal_dag.nodes():
markov_blanket_nodes = causal_dag.get_markov_blanket(target_variable)
parents = causal_dag.get_parents(target_variable)
children = causal_dag.get_children(target_variable)
spouses = set()
for child in children:
spouses.update(causal_dag.get_parents(child))
spouses.discard(target_variable)
spouses = list(spouses)
plt.style.use('default')
fig, ax = plt.subplots(figsize=(16, 12), facecolor='white')
G = nx.DiGraph(causal_dag.edges())
pos = nx.spring_layout(G, k=1.5, seed=42)
node_colors = []
for node in G.nodes():
if node == target_variable:
node_colors.append('#8B0000')
elif node in parents:
node_colors.append('#228B22')
elif node in children:
node_colors.append('#4682B4')
elif node in spouses:
node_colors.append('#DAA520')
else:
node_colors.append('#E0E0E0')
nx.draw_networkx_nodes(G, pos, node_size=3500, node_color=node_colors, edgecolors='black', linewidths=1.5, alpha=0.9, ax=ax)
nx.draw_networkx_edges(G, pos, edge_color='lightgray', arrows=True, arrowstyle='-|>', arrowsize=15, width=1.0, alpha=0.5, ax=ax)
mb_family = [target_variable] + parents + children + spouses
mb_edges = [(u, v) for u, v in G.edges() if u in mb_family and v in mb_family]
nx.draw_networkx_edges(G, pos, edgelist=mb_edges, edge_color='black', arrows=True, arrowstyle='-|>', arrowsize=25, width=2.5, ax=ax)
nx.draw_networkx_labels(G, pos, font_size=9, font_weight='bold', font_family='sans-serif', ax=ax)
legend_elements = [
mpatches.Patch(facecolor='#8B0000', edgecolor='black', label='Target Variable (WTI)'),
mpatches.Patch(facecolor='#228B22', edgecolor='black', label='Parents (Direct Causes)'),
mpatches.Patch(facecolor='#4682B4', edgecolor='black', label='Children (Direct Effects)'),
mpatches.Patch(facecolor='#DAA520', edgecolor='black', label='Spouses (Co-parents of Children)'),
mpatches.Patch(facecolor='#E0E0E0', edgecolor='black', label='Conditionally Independent Nodes')
]
ax.legend(handles=legend_elements, loc='upper right', title="Markov Blanket Topology",
fontsize=11, title_fontsize=13, frameon=True, borderpad=1.5)
ax.set_title(f"Figure 12: Markov Blanket Isolation for {target_variable}", fontsize=18, fontweight='bold', pad=20)
plt.axis('off')
plt.tight_layout()
plt.show()
From macroeconomic and geopoitic point of view , the identification of US Energy CPI as a direct cause reveals that the barrel price is strictly linked to the inflationary domestric cycles (Figure 12). For investiment decision, this suggest that the monitoring this indicator can help us to anticipate pressures on the capital cost and adjust a exposition of the proteting assets to inflation.
From micoecomic point o view, the inclusion of oil services ETF in the blaket confirm the dynamic operational and estraction cost are determinantes of supply (Figure 12). The recomendation could be to observe the performance of services providers as leading price signal formations.
From financial point of view the causal link estabilished by SP500 Index show that the oil operate as measure of global risk and market liquidity (Figure 12).
from IPython.display import display, Latex
algorithm_latex = r"""
\begin{array}{l}
\hline
\mathbf{Algorithm\ 1:} \text{Stuctural learnning by causal inference} \\
\hline
\mathbf{Input:} \\
\quad \mathcal{D}: \text{Observations Dataset} \\
\quad X = \{X_1, \dots, X_m\}: \text{Set of endogenous and exogenous variables} \\
\quad \alpha: \text{Significance level} \\
\mathbf{Output:} \text{Partially complete Directed Acyclic Graph (CPDAG) } G = \langle V, E \rangle \\
\hline
\mathbf{Phase\ 1:\ Learnning \ Markov\ Blanket} \\
\mathbf{for\ each} \ X_i \in X \ \mathbf{do} \\
\quad \text{Blanket Markov } \mathcal{B}(X_i) \text{ interactively via condictionaly independency tests} \\
\quad \text{(where } \mathcal{B}(X_i) \text{ is the minimal set of parents, child and spouses suchthat } X_i \perp\!\!\!\perp X \setminus (\{X_i\} \cup \mathcal{B}(X_i)) \mid \mathcal{B}(X_i) \text{)} \\
\mathbf{end\ for} \\
\text{Validate neighborhood symmetry: } \mathbf{if} \ X_i \in \mathcal{B}(X_j) \text{ but } X_j \notin \mathcal{B}(X_i), \text{ reconcile} \\
\\
\mathbf{Phase\ 2:\ Neighborhood \ learning} \\
\text{Initialize the undirected arrow graph } X_i - X_j \text{ for all validated pairs in Phase 1} \\
\mathbf{for\ each} \ \text{adjacent pair } (X_i, X_j) \ \mathbf{do} \\
\quad \text{Identify separating edge } S_{X_iX_j} \subseteq \{ \mathcal{B}(X_i) \cup \mathcal{B}(X_j) \} \setminus \{X_i, X_j\} \\
\quad \mathbf{if} \ X_i \perp\!\!\!\perp X_j \mid S_{X_iX_j} \ \text{with p-value } > \alpha \ \mathbf{then} \\
\quad \quad \text{Remove the arrow between } X_i \text{ and } X_j \text{ (we reject the direct causal dependency)} \\
\quad \quad \text{Store } S_{X_iX_j} \text{ (required for orienting colliders in Phase 3)} \\
\quad \mathbf{else} \\
\quad \quad \text{Keep undirected edge } X_i - X_j \\
\quad \mathbf{end\ if} \\
\mathbf{end\ for} \\
\\
\mathbf{Phase\ 3:\ Edge \ orientation\ of\ edges\ (Causal\ attribution)} \\
\text{1. Identify collision structures (v-structures):} \\
\quad \mathbf{for\ each} \ \text{triple } X_i - X_k - X_j \text{ (where } X_i \text{ e } X_j \text{ if are adjacet)} \ \mathbf{do} \\
\quad \quad \mathbf{if} \ X_k \notin S_{X_iX_j} \ \mathbf{then} \text{ orient converging to the collider: } X_i \rightarrow X_k \leftarrow X_j \\
\quad \mathbf{end\ for} \\
\text{2. Propagation of consistency rules (Meek's rules):} \\
\quad \mathbf{repeat\ until} \ \text{structural stability (no new edges can be oriented)} \mathbf{do} \\
\quad \quad \text{R1 (Prevent new colliders): If } X_i \rightarrow X_j - X_k \text{ and } X_i, X_k \text{ are not adjacent } \Rightarrow X_j \rightarrow X_k \\
\quad \quad \text{R2 (Prevent cycles): If } X_i \rightarrow X_k \rightarrow X_j \text{ and } X_i - X_j \Rightarrow X_i \rightarrow X_j \\
\quad \quad \text{R3 (Double V-structure): If } X_i - X_k \rightarrow X_j \text{, } X_i - X_l \rightarrow X_j \text{ and } X_i - X_j \Rightarrow X_i \rightarrow X_j \\
\quad \quad \text{R4 (Propagation): If } X_i - X_k \rightarrow X_l \rightarrow X_j \text{ and } X_i - X_l \text{, } X_i - X_j \Rightarrow X_i \rightarrow X_j \\
\quad \mathbf{end\ repeat} \\
\hline
\end{array}
"""
display(Latex(algorithm_latex))
Conclusions¶
The systematic tretemet of data showed this appproach outperform tradictionaly econometric approchs. and the procedures used to clean the data helped to make analysis whithout lose information or eliminate possible important marked behavior. Te identification of US Energy CPI as a direct cause reveals that the barrel price is strictly linked to the inflationary domestric cycles, suggesting that the monitoring this indicator is important when we need to anticipate market. The inclusion of oil services ETF in the blaket confirmed the dynamic operational and estraction cost are determinantes of supply, which suggest that its important to observe the performance of services providers as leader price signal formations and the causal link of SP500 Index mean that oil operate as measure of global risk market.
Reference¶
Alvi, D. A. (2018). Application of probabilistic graphical models in forecasting crude oil price. arXiv preprint arXiv:1804.10869.