Introduction¶

In the realm of financial data analysis, ensuring the quality and reliability of data is paramount. The significance of good quality data becomes evident when we examine its implications in financial modeling and decision-making processes. This assignment is structured to provide a comprehensive understanding of data quality, yield curve modeling, the exploitation of correlation, and empirical analysis of Exchange-Traded Funds (ETFs), all within the broader context of financial data analysis.

Firstly (Question 1), we will delve into the concept of data quality by providing examples of poor-quality structured and unstructured data, along with methods to identify their shortcomings. This foundational knowledge is crucial for recognizing the importance of maintaining high standards in financial data.

Next (Question 2), we will explore yield curve modeling by fitting Nelson-Siegel and Cubic-Spline models to government securities from a European Union (Germany), ranging from short-term to long-term maturities. The section 2 will highlight the comparative analysis of these models in terms of fit and interpretation, addressing the ethical considerations of data smoothing.

Furthermore, in Question 3 we will investigate the role of correlation and principal components in financial data analysis by generating uncorrelated Gaussian random variables and running Principal Component Analysis (PCA). The analysis will be extended to real data, collecting daily closing yields for government securities and comparing the results through screeplots.

Lastly, in Question 4, the empirical analysis of ETFs will involve computing daily returns, covariance matrices, PCA, and Singular Value Decomposition (SVD) for a sector ETF, providing a thorough explanation of each transformation and its implications.

Installing Packages¶

In [ ]:
! pip install -q kaggle --quiet
! pip install pandas  --quiet
! pip install numpy  --quiet
! pip install matplotlib --quiet
! pip install seaborn --quiet
! pip install pingouin --quiet
! pip install factor_analyzer --quiet
! pip install gdown --quiet
! pip install datetime --quiet
! pip install requests --quiet
! pip install yfinance --quiet
! pip install pandas-datareader --quiet
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.4/204.4 kB 5.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 kB 2.0 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Building wheel for factor_analyzer (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.4/44.4 kB 2.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.6/52.6 kB 4.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 259.8/259.8 kB 10.2 MB/s eta 0:00:00

Importing packages¶

In [ ]:
import pandas as pd
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
from factor_analyzer.factor_analyzer import calculate_kmo
import pingouin as pg
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as LA
from scipy.interpolate import CubicSpline
from scipy.optimize import minimize
from datetime import datetime, timedelta
import time
from operator import index
import yfinance as yfin
from PIL.ImageChops import screen
from fontTools.misc.plistlib import end_key
from matplotlib.projections import projection_registry
from numpy.array_api import full_like
from numpy.core.shape_base import block
from numpy.ma.core import around, outer
from rich.jupyter import display
from scipy import stats
from numpy import linalg as la
from scipy.odr import exponential
from unicodedata import decimal
from IPython.display import set_matplotlib_formats
import math
import re
import gdown
import warnings
import requests
warnings.filterwarnings('ignore')
<ipython-input-6-1dea6b209a65>:19: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
  from numpy.array_api import full_like

Question 1: Data Quality¶

Data quality is defined by its ability to meet the expectations of data consumers and accurately represent the objects, events, and concepts it is intended to depict (Sebastian xxx). Data can be classified into structured, unstructured, and semi-structured data (Praveen and Umesh 67). Beyond this classification, data possess dimensions that contribute to its quality, including accuracy, objectivity, believability, reputation, accessibility, access security, relevancy, value-added, timeliness, completeness, amount of data, interpretability, ease of understanding, concise representation, and currency (Huh et al. 560; Strong et al. 104). The absence of at least one of these dimensions can result in the data classification being preceded (prefix) by the term poor quality. Possible causes of poor data quality include missing values, data errors, discrepancies, biases, inconsistencies, static header data, changes in historical data, lack of transparency, reporting time issues, and misuse of data (Liu 34). Other causes can include multiple names for the same entity, missing values, incorrect values, and duplicate records for the same customer (Wang and Diane 7; Nadinić and Kalpić 328). Thus, to answer the questions of this exercise, we will base ourselves on these aspects.

Thus, to answer the questions of this exercise, we will base ourselves on these aspects.

a. Provide an example of poor quality structured data¶

Answer

In [ ]:
file_id = "1Rp1Ubva5Q7ReW_zs7EowH3yge8rq-i5w"
url = f"https://drive.google.com/uc?id={file_id}"
output_filename = "financials.zip"
gdown.download(url, output_filename, quiet=False)

import os
if os.path.exists(output_filename):
    print(f"File '{output_filename}' downloaded successfully!")
else:
    print("File download failed.")
try:
    import pandas as pd
    datafin = pd.read_csv(output_filename)
#    print(datafin.head())
except Exception as e:
    print(f"Unable to read the file: {e}")

datafin.head()
Downloading...
From: https://drive.google.com/uc?id=1Rp1Ubva5Q7ReW_zs7EowH3yge8rq-i5w
To: /content/financials.zip
100%|██████████| 22.0k/22.0k [00:00<00:00, 33.8MB/s]
File 'financials.zip' downloaded successfully!

Out[ ]:
Segment Country Product Discount Band Units Sold Manufacturing Price Sale Price Gross Sales Discounts Sales COGS Profit Date Month Number Month Name Year
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014
In [ ]:
datafin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Segment                700 non-null    object
 1   Country                700 non-null    object
 2    Product               700 non-null    object
 3    Discount Band         700 non-null    object
 4    Units Sold            700 non-null    object
 5    Manufacturing Price   700 non-null    object
 6    Sale Price            700 non-null    object
 7    Gross Sales           700 non-null    object
 8    Discounts             700 non-null    object
 9     Sales                700 non-null    object
 10   COGS                  700 non-null    object
 11   Profit                700 non-null    object
 12  Date                   700 non-null    object
 13  Month Number           700 non-null    int64 
 14   Month Name            700 non-null    object
 15  Year                   700 non-null    int64 
dtypes: int64(2), object(14)
memory usage: 87.6+ KB

Data description: This dataset, sourced from Atharva Arya, provides detailed sales and profit information across various market segments and countries, including key metrics like units sold, sale price, gross sales, discounts, and profit. It enables in-depth analysis of sales performance, profitability, and time-based trends to support business insights and decision-making.

In [ ]:
datafin.head()
Out[ ]:
Segment Country Product Discount Band Units Sold Manufacturing Price Sale Price Gross Sales Discounts Sales COGS Profit Date Month Number Month Name Year
0 Government Canada Carretera None $1,618.50 $3.00 $20.00 $32,370.00 $- $32,370.00 $16,185.00 $16,185.00 01/01/2014 1 January 2014
1 Government Germany Carretera None $1,321.00 $3.00 $20.00 $26,420.00 $- $26,420.00 $13,210.00 $13,210.00 01/01/2014 1 January 2014
2 Midmarket France Carretera None $2,178.00 $3.00 $15.00 $32,670.00 $- $32,670.00 $21,780.00 $10,890.00 01/06/2014 6 June 2014
3 Midmarket Germany Carretera None $888.00 $3.00 $15.00 $13,320.00 $- $13,320.00 $8,880.00 $4,440.00 01/06/2014 6 June 2014
4 Midmarket Mexico Carretera None $2,470.00 $3.00 $15.00 $37,050.00 $- $37,050.00 $24,700.00 $12,350.00 01/06/2014 6 June 2014
In [ ]:
# Checking if there are any NA values
datafin.isna().any().any()
Out[ ]:
False
In [ ]:
# Checking if there are duplicated data
datafin.duplicated().sum()
Out[ ]:
0
In [ ]:
datafin[' Discount Band '].unique()
Out[ ]:
array([' None ', ' Low ', ' Medium ', ' High '], dtype=object)
In [ ]:
datafin.columns
Out[ ]:
Index(['Segment', 'Country', ' Product ', ' Discount Band ', ' Units Sold ',
       ' Manufacturing Price ', ' Sale Price ', ' Gross Sales ', ' Discounts ',
       '  Sales ', ' COGS ', ' Profit ', 'Date', 'Month Number',
       ' Month Name ', 'Year'],
      dtype='object')

b. How would you recognize this poor quality?¶

Write 3 - 4 sentences that show how the data fails to include properties of good quality data.

Answer

Although the data does not have missing or duplicated values, it includes currency symbols ($), missing or incorrect values, like the placeholder ($-), commas, and extra spaces before and after the names of some variables as well as their observations. Additionally, quantitative variables are classified as objects rather than as integers (int) or floats, which makes it difficult to perform numerical operations or analysis.

c. Provide an example of poor quality unstructured data

In [ ]:
file_id = "1NIakNIwqoDgw9CvFWrU-s-uuJHdu6KQu"
url = f"https://drive.google.com/uc?id={file_id}"
output_filename = "Airplane_Crashes_and_Fatalities_Since_1908.csv.zip"
gdown.download(url, output_filename, quiet=False)

import os
if os.path.exists(output_filename):
    print(f"File '{output_filename}' downloaded successfully!")
else:
    print("File download failed.")

try:
    import pandas as pd
    data = pd.read_csv(output_filename)
  #  print(data.head())
except Exception as e:
    print(f"Unable to read the file: {e}")
data.head()
Downloading...
From: https://drive.google.com/uc?id=1NIakNIwqoDgw9CvFWrU-s-uuJHdu6KQu
To: /content/Airplane_Crashes_and_Fatalities_Since_1908.csv.zip
100%|██████████| 596k/596k [00:00<00:00, 103MB/s]
File 'Airplane_Crashes_and_Fatalities_Since_1908.csv.zip' downloaded successfully!

Out[ ]:
index Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary
0 0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly...
1 1 07/12/1912 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh...
2 2 08/06/1913 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc...
3 3 09/09/1913 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou...
4 4 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked...

Data description: This dataset, sourced from Data Society, compiles information on Boeing 707 accidents dating back to 1948. It includes various details such as the incident's date and time, location, and the operator of the aircraft. Additional data includes the flight number, the route taken, and the aircraft's type and registration number. Also recorded are the construction (serial) number, the total number of people aboard, fatalities, fatalities on the ground, and a summary of each incident.

In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         5268 non-null   int64  
 1   Date          5268 non-null   object 
 2   Time          3049 non-null   object 
 3   Location      5248 non-null   object 
 4   Operator      5250 non-null   object 
 5   Flight #      1069 non-null   object 
 6   Route         3561 non-null   object 
 7   Type          5241 non-null   object 
 8   Registration  4933 non-null   object 
 9   cn/In         4040 non-null   object 
 10  Aboard        5246 non-null   float64
 11  Fatalities    5256 non-null   float64
 12  Ground        5246 non-null   float64
 13  Summary       4878 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 576.3+ KB
In [ ]:
data.columns
Out[ ]:
Index(['index', 'Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route',
       'Type', 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground',
       'Summary'],
      dtype='object')
In [ ]:
data.isna().sum()
Out[ ]:
0
index 0
Date 0
Time 2219
Location 20
Operator 18
Flight # 4199
Route 1707
Type 27
Registration 335
cn/In 1228
Aboard 22
Fatalities 12
Ground 22
Summary 390

In [ ]:
data.duplicated().sum()
Out[ ]:
0
In [ ]:
groupdata = data.groupby(['Date', 'Time', 'Location', 'Flight #'])['Fatalities'].sum()
groupdata
Out[ ]:
Fatalities
Date Time Location Flight #
01/01/1976 05:30 Near Al Qaysumah, Saudi Arabia 438 82.0
01/01/1978 20:15 Off Bandra, Maharashtra, India 855 213.0
01/01/1985 20:30 Near La Paz, Bolivia 980 29.0
01/01/2002 18:02 Hollywood, Florida - 5.0
01/01/2007 14:07 Off Parepare, Indonesia 574 102.0
... ... ... ... ...
12/29/1991 15:05 Wanli, Taiwan 358 5.0
12/29/1994 15:30 Near Van, Turkey 278 57.0
12/30/1945 21:13 New York City, New York 14 1.0
12/30/1951 22:07 Near Fairbanks, Alaska 501 4.0
12/31/1968 08:30 Port Hedland, WA, Australia 1750 26.0

996 rows × 1 columns


In [ ]:
filtered_data=data[(data['Time'].isna() & data['Flight #'].isna()) & (data['Fatalities'] >= 10)]
filtered_data.head()
Out[ ]:
index Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary
7 7 07/28/1916 NaN Near Jambol, Bulgeria Military - German Army NaN NaN Schutte-Lanz S-L-10 (airship) NaN NaN 20.0 20.0 0.0 Crashed near the Black Sea, cause unknown.
10 10 11/21/1916 NaN Mainz, Germany Military - German Army NaN NaN Super Zeppelin (airship) NaN NaN 28.0 27.0 0.0 Crashed in a storm.
12 12 03/04/1917 NaN Near Gent, Belgium Military - German Army NaN NaN Airship NaN NaN 20.0 20.0 0.0 Caught fire and crashed.
13 13 03/30/1917 NaN Off Northern Germany Military - German Navy NaN NaN Schutte-Lanz S-L-9 (airship) NaN NaN 23.0 23.0 0.0 Struck by lightning and crashed into the Balti...
19 19 05/10/1918 NaN Off Helgoland Island, Germany Military - German Navy NaN NaN Zeppelin L-70 (airship) NaN NaN 22.0 22.0 0.0 Shot down by British aircraft crashing from a ...
In [ ]:
filtered_data.shape
Out[ ]:
(986, 14)

d. Unstructured data can be more difficult to assess than structured data. Just you did in part b, write 3 - 4 sentences that show how this unstructured data fails to check requirements of good quality data.

Answer

Although there are no duplicate data, there is an excess of missing values (NaN), even in situations where fatalities are recorded. In other words, there are missing values in columns like "Flight #" and "cn/In," which makes it difficult to fully assess the incident details and compare across records. For instance, a filter was applied to all rows with missing values in the "Date" and "Flight #" variables but with fatalities above 10, and it was found that there are 986 rows, which reinforces the fact of poor quality. Additionally, variables like "Date" are not properly formatted (datatype), and there is an "Index" column that does not contribute to the analysis, further indicating that the data is not structured properly (unstructured).

Question 2: Yield Curve Modeling¶

a. Pick government securities from a country. The country selected should be one of the countries from your group so that you can fit a Nelson-Siegel mode

Answer

Fitting a Nelson-Siegel model on EU government securities

In [ ]:
base_url = 'https://sdw-wsrest.ecb.europa.eu/service/'
params = {
   'startPeriod': '2020-01-01',
   'endPeriod': '2024-12-31',
   'format': 'jsondata'
}

maturities = ['SR_3M', 'SR_1Y', 'SR_2Y', 'SR_5Y', 'SR_10Y', 'SR_30Y']
data_dict = {}

for maturity in maturities:
   url = f"{base_url}data/YC/B.U2.EUR.4F.G_N_A.SV_C_YM.{maturity}"
   response = requests.get(url, params=params)
   data = response.json()

   # Extract values (first element of each observation array)
   values = [obs[0] for obs in data['dataSets'][0]['series']['0:0:0:0:0:0:0']['observations'].values()]

   # Get dates from structure
   dates = [date['id'] for date in data['structure']['dimensions']['observation'][0]['values']]

   data_dict[maturity] = pd.Series(values, index=pd.to_datetime(dates))

df = pd.DataFrame(data_dict)
df.columns = ['3M', '1Y', '2Y', '5Y', '10Y', '30Y']

b. Be sure to pick maturities ranging from short-term to long-term (e.g. 6 month maturity to 20 or 30 year maturities).

Answer

In [ ]:
df
Out[ ]:
3M 1Y 2Y 5Y 10Y 30Y
2020-01-02 -0.623324 -0.634844 -0.622180 -0.478606 -0.173995 0.358465
2020-01-03 -0.595449 -0.630324 -0.638365 -0.520394 -0.222926 0.306064
2020-01-06 -0.597296 -0.637580 -0.651086 -0.541757 -0.247426 0.283818
2020-01-07 -0.605300 -0.641443 -0.650553 -0.532864 -0.232130 0.304965
2020-01-08 -0.597340 -0.633612 -0.641188 -0.515378 -0.204580 0.341047
... ... ... ... ... ... ...
2024-12-20 2.616587 2.172129 1.960729 2.049615 2.374684 2.445430
2024-12-23 2.632250 2.201549 1.997799 2.084822 2.402344 2.456214
2024-12-24 2.661780 2.211694 2.003148 2.090155 2.405459 2.455847
2024-12-27 2.653268 2.188540 2.007697 2.126310 2.431800 2.494656
2024-12-30 2.575177 2.178646 2.011151 2.130018 2.447304 2.513773

1278 rows × 6 columns

In [ ]:
df.to_csv('yield_curves.csv')
In [ ]:
def nelson_siegel(t, params):
    beta0, beta1, beta2, tau = params
    # Avoid division by zero
    exp_term = np.exp(-t/tau)
    term = (1 - exp_term)/(t/tau)
    return beta0 + beta1 * term + beta2 * (term - exp_term)

def objective(params, t, y):
    return np.sum((nelson_siegel(t, params) - y)**2)

# Convert maturities to years
maturities = np.array([0.25, 1, 2, 5, 10, 30])

# Fit for each date
results = []
for date in df.index:
    yields = df.loc[date].values

    # Initial parameters [β₀, β₁, β₂, τ]
    initial_guess = [yields[-1], yields[0]-yields[-1], 0, 2]

    # Bounds for parameters
    bounds = [(0, 15), (-15, 15), (-15, 15), (0.1, 10)]

    res = minimize(objective, initial_guess, args=(maturities, yields),
                  bounds=bounds, method='L-BFGS-B')

    # Store parameters and fitted yields
    params = res.x
    fitted_yields = nelson_siegel(maturities, params)
    rmse = np.sqrt(np.mean((yields - fitted_yields)**2))

    results.append({
        'date': date,
        'beta0': params[0],  # Level
        'beta1': params[1],  # Slope
        'beta2': params[2],  # Curvature
        'tau': params[3],    # Decay
        'rmse': rmse
    })

# Create DataFrame of results
results_df = pd.DataFrame(results).set_index('date')
In [ ]:
results_df
Out[ ]:
beta0 beta1 beta2 tau rmse
date
2020-01-02 0.698999 -1.310728 -1.646603 3.455619 0.001258
2020-01-03 0.640179 -1.215057 -1.799441 3.325815 0.000900
2020-01-06 0.620464 -1.195045 -1.839001 3.329411 0.000919
2020-01-07 0.644782 -1.228894 -1.830838 3.332548 0.000943
2020-01-08 0.683263 -1.259286 -1.862525 3.289445 0.000826
... ... ... ... ... ...
2024-12-20 2.569276 0.315362 -2.540065 1.188862 0.035549
2024-12-23 2.576592 0.318141 -2.435052 1.169326 0.037266
2024-12-24 2.575070 0.360573 -2.474106 1.149855 0.037276
2024-12-27 2.607960 0.329663 -2.519298 1.095611 0.032313
2024-12-30 2.632903 0.187122 -2.373655 1.157329 0.034248

1278 rows × 5 columns

c. Fit a Nelson-Siegel model.

Answer

In [ ]:
sns.set()

# Read data
df = pd.read_csv('yield_curves.csv', index_col=0)
df.index = pd.to_datetime(df.index)

# Nelson-Siegel function
def nelson_siegel(t, params):
    beta0, beta1, beta2, tau = params
    exp_term = np.exp(-t/tau)
    return beta0 + beta1 * ((1 - exp_term)/(t/tau)) + beta2 * ((1 - exp_term)/(t/tau) - exp_term)

# Objective function
def objective(params, maturities, rates):
    return np.sum((nelson_siegel(maturities, params) - rates)**2)

# Convert maturities to years
maturities = np.array([0.25, 1, 2, 5, 10, 30])
latest_rates = df.iloc[-1].values

# Fit Nelson-Siegel model
initial_guess = [3, -2, 2, 1]  # [beta0, beta1, beta2, tau]
result = minimize(objective, initial_guess, args=(maturities, latest_rates))
fitted_params = result.x

# Generate smooth curve
t_points = np.linspace(0.1, 30, 100)
fitted_curve = nelson_siegel(t_points, fitted_params)

# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Nelson-Siegel Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.1: Nelson-Siegel Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)

print("Fitted Parameters:")
print(f"β0 (level): {fitted_params[0]:.4f}")
print(f"β1 (slope): {fitted_params[1]:.4f}")
print(f"β2 (curvature): {fitted_params[2]:.4f}")
print(f"τ (decay): {fitted_params[3]:.4f}")

plt.show()
Fitted Parameters:
β0 (level): 2.0494
β1 (slope): -0.3579
β2 (curvature): 1.3981
τ (decay): 0.2822
No description has been provided for this image

d. Fit a Cubic-Spline model.

Answer

In [ ]:
cs = CubicSpline(maturities, latest_rates)

# Generate smooth curve
t_points = np.linspace(0.1, 30, 100)
fitted_curve = cs(t_points)

# Plot
plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.scatterplot(x=maturities, y=latest_rates, color='blue', s=100, label='Actual Yields')
plt.plot(t_points, fitted_curve, 'r-', label='Cubic Spline Fit')
plt.xlabel('Maturity (years)')
plt.ylabel('Yield (%)')
plt.title(f'Figure 2.2: Cubic Spline Yield Curve Fit ({df.index[-1].strftime("%Y-%m-%d")})')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

e. Compare the models in terms of 1) fit and 2) interpretation

Answer

In analyzing the Nelson-Siegel (Figure 2.1) and cubic models (Figure 2.2) for interest rate term structure, both models offer distinct advantages. The Nelson-Siegel model uses three parameters to shape the yield curve, making it flexible and straightforward to interpret (Wahlstrøm et al. 971). This model excels in its simplicity and ability to clearly represent economic factors like long-term rates, curve steepness, and mid-maturity humps (Akinyemi et al. 14). The Nelson-Siegel model produces a relatively smooth curve that captures the general trend but misses some key data points. The model parameters shown ($\beta_{0}$=2.6494, $\beta_{1}$=-0.2579, $\beta_{2}$=1.3981, $\tau$=0.2822) generate a curve that notably underestimates yields at both very short and long maturities. The misfit is particularly visible at the 30-year point where the actual yield is around 2.5% but the fitted curve predicts closer to 2.1%.

Conversely, the cubic model employs a third-degree polynomial to capture complex, non-linear data trends. Although it provides a more precise fit for intricate patterns, interpreting its coefficients demands careful analysis. Overall, the Nelson-Siegel model is favored for its ease of use and clarity, while the cubic model is better suited for detailed data fitting but requires deeper examination of its terms (Nymand 43). Cubic Spline model produces a more flexible fit that passes much closer to the observed data points. It captures both the short-term dynamics and the long-term behavior more accurately. The curve shows more local variation, particularly evident in how it models the "belly" of the yield curve between 5 and 15 years.

f. Be sure to specify at the levels of model parameters (ex. Alpha1).

Answer

The Nelson-Siegel model offers clear economic interpretation of its parameters. $\beta_{0}$ (2.6494) represents the long-term interest rate level, $\beta_{1}$ (-0.2579) controls the slope of the yield curve, and $\beta_{2}$ (1.3981) determines the curvature. The decay parameter $\tau$ (0.2822) governs how quickly the short-term components decay. These parameters have direct economic meaning and can be used to understand market expectations about future interest rates. The Cubic Spline model, while providing better fit, lacks such clear economic interpretation. Its parameters are purely mathematical, representing the coefficients of cubic polynomials between knot points. This makes it harder to extract meaningful economic insights about market expectations or risk premiums from the model parameters.

g. In Module 2 Lesson 4 (‘Smoothing Data’), we said smoothing data can be unethical. If Nelson-Siegel is smoothing the yield curve, is this considered unethical? Why or why not?

Answer

Using smoothing techniques like the Nelson-Siegel method to estimate yield curves is typically ethical when the goal is to improve the accuracy of models and better understand market trends. These methods help to remove noise and clarify long-term data patterns. However, it crosses into unethical territory if used to mislead stakeholders or distort data, such as by inflating asset values or hiding volatility. The key to ethical use lies in being transparent about the methodology and intentions. Problems arise when the smoothing process is used to misrepresent the true nature of the data (Copeland 103).

3. Exploiting Correlation¶

Financial Data is meant not only to process data but to understand how meaningful factors can be used to summarize or represent the data. Let’s understand the role that correlation and principal components play.

a) Generate 5 uncorrelated Gaussian random variables that simulate yield changes (they can be positive or negative with a mean close to 0 and a standard deviation that is small).

Answer¶

Table of first five values generated

In [ ]:
np.random.seed(432)
num_samples = 100
num_variables = 5
mean=0
std_deviation=0.01
data=np.random.normal(loc=mean,scale=std_deviation, size=(num_samples,num_variables))
dataF=pd.DataFrame(data)
dataF.columns=["Variable 1","Variable 2","Variable 3","Variable 4","Variable 5"]
dataF.head()
Out[ ]:
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5
0 0.001869 0.000345 -0.003848 0.004646 0.006009
1 -0.017359 -0.003120 0.003078 -0.002261 -0.000731
2 0.000010 0.003327 0.019509 -0.007947 0.006518
3 -0.000422 -0.003114 0.000570 0.003264 -0.003010
4 0.002769 0.005642 0.005877 0.016587 -0.014441

Correlation matrix

In [ ]:
matrix_corr = dataF.corr()
print(matrix_corr)
            Variable 1  Variable 2  Variable 3  Variable 4  Variable 5
Variable 1    1.000000   -0.008919    0.039846    0.013316    0.202243
Variable 2   -0.008919    1.000000    0.121684    0.155779   -0.133540
Variable 3    0.039846    0.121684    1.000000   -0.130289   -0.111868
Variable 4    0.013316    0.155779   -0.130289    1.000000   -0.092115
Variable 5    0.202243   -0.133540   -0.111868   -0.092115    1.000000

Correloplot

In [ ]:
corr = dataF.corr()

f, ax =plt.subplots(figsize=(9, 6))
mask=np.triu(np.ones_like(corr, dtype=bool))
cmap=sns.diverging_palette(230, 20, n=256, as_cmap=True)
sns.heatmap(dataF.corr(),
            mask=mask,
            cmap=cmap,
            vmax=1,
            vmin = -.25,
            center=0,
            square=True,
            linewidths=.5,
            annot = True,
            fmt='.3f',
            annot_kws={'size': 16},
            cbar_kws={"shrink": .75})

plt.title('Figure 3.1: Correloplot')
plt.tight_layout()
ax.tick_params(axis = 'x', labelsize = 14)
ax.tick_params(axis = 'y', labelsize = 14)
ax.set_ylim(len(corr))

plt.show()
No description has been provided for this image

As we see in the correleplot (Figure 3.1) all correlation values a near zero, which aline to what exercise need, 5 uncorrelated exercises.

b) Run a Principal Components using EITHER the correlation OR covariance matrix.

Answer

Eigenvalues

In [ ]:
# Calculating eigenvectors and eigenvalues of the covariance matrix of simulated dataset
dataF_cov = dataF.cov()
eigenvalues, eigenvectors = LA.eig(dataF_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues
Out[ ]:
array([1.27782260e-04, 1.14523588e-04, 9.47725977e-05, 7.27875167e-05,
       6.21821048e-05])

Eigenvectors

In [ ]:
eigenvectors
Out[ ]:
array([[ 0.1038149 , -0.21881596, -0.68515343, -0.26798617, -0.63252686],
       [-0.49630891,  0.41260863, -0.40928584,  0.64272115, -0.053162  ],
       [-0.74956086, -0.50816345, -0.1023671 , -0.29393446,  0.28818733],
       [-0.01974409,  0.64348683, -0.31144041, -0.59654411,  0.3642456 ],
       [ 0.42505132, -0.33101102, -0.50554497,  0.26987143,  0.61754091]])

Principal components

In [ ]:
principal_components =dataF.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.head()
Out[ ]:
PC_1 PC_2 PC_3 PC_4 PC_5
0 0.005369 0.002689 -0.005512 -0.000298 0.003094
1 -0.002826 -0.000266 0.013929 0.002893 0.010758
2 -0.013346 -0.015815 -0.004186 0.002901 0.006569
3 -0.000269 0.001614 0.002010 -0.004816 -0.000073
4 -0.013384 0.014189 -0.002674 -0.012635 -0.003234

c) Write a paragraph explaining how the variances of each component compare with each other. In this paragraph, you will address the following question: how much variance is explained by Component 1, Component 2, Component 3?

Answer

In [ ]:
# Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))

# Work out explained proportion
df_eigval["Explained variance proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Cumulative variance proportion"] = df_eigval["Explained variance proportion"].cumsum()
#Format as percentage
df_eigval.style.format({"Explained variance proportion": "{:.2%}",
    "Cumulative variance proportion": "{:.2%}"})
Out[ ]:
  Eigenvalues Explained variance proportion Cumulative variance proportion
1 0.000128 27.07% 27.07%
2 0.000115 24.26% 51.33%
3 0.000095 20.08% 71.41%
4 0.000073 15.42% 86.83%
5 0.000062 13.17% 100.00%

The component 1, component 2 and component 3 explain 27.07%, 24.26% and 20.08%, respectively, of the total variability of our dataset. Cumulatively, the first three components explain 71.41% of the total data variability.

d. Produce a screeplot (see https://en.wikipedia.org/wiki/Scree_plot) of the variance explained for each component.

Answer

In [ ]:
plt.figure(figsize=(8, 5))
plt.plot(range(1,len(eigenvalues)+1), eigenvalues, marker="o",linestyle="--")
plt.title("Figure 3.2: Scree Plot")
plt.xlabel("Principal Components")
plt.ylabel("Eigenvalue")
plt.xticks(range(1,len(eigenvalues)+1))
plt.grid(True)
No description has been provided for this image

e. Collect the daily closing yields for 5 government securities, say over 6 months.

Answer

In [ ]:
api_key = "ec67a401ae513935b544843d8c308536"  # Alex API
series_ids = [
    'DGS1',
    'DGS2',
    'DGS5',
    'DGS10',
    'DGS30'
]

base_url = "https://api.stlouisfed.org/fred/series/observations"

# Date range for the past 7 months
end_date = datetime.now().date()
start_date = end_date - timedelta(days=7*30)  # 7 months ago

# Function to fetch yield data
def get_yield_data(series_id, start_date, end_date):
    params = {
        "series_id": series_id,
        "api_key": api_key,
        "file_type": "json",
        "observation_start": str(start_date),
        "observation_end": str(end_date),
    }

    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        res_data = response.json()
        if 'observations' in res_data:
            obs_data = pd.DataFrame(res_data['observations'])
            obs_data['date'] = pd.to_datetime(obs_data['date'])
            obs_data.set_index('date', inplace=True)
            obs_data['value'] = pd.to_numeric(obs_data['value'], errors='coerce')
            return obs_data
    return None

# Collect yield data for each series
yields_dict = {}
for series_id in series_ids:
    yields_dict[series_id] = get_yield_data(series_id, start_date, end_date)

# Combine data into a single DataFrame
yields_df = pd.DataFrame({series_id: data['value'] for series_id, data in yields_dict.items() if data is not None})
yields_df.columns = ['1 Year', '2 Year', '5 Year', '10 Year', '30 Year']
# Compute daily yield changes using .diff()
daily_yield_changes = yields_df.diff()
daily_yield_changes.dropna(inplace=True)
print(daily_yield_changes)
            1 Year  2 Year  5 Year  10 Year  30 Year
date                                                
2024-06-25    0.00   -0.06   -0.02    -0.02    -0.02
2024-06-26    0.03    0.06    0.07     0.09     0.09
2024-06-27   -0.03   -0.01   -0.03    -0.03    -0.02
2024-06-28   -0.01    0.01    0.04     0.07     0.08
2024-07-01    0.01    0.06    0.11     0.12     0.13
...            ...     ...     ...      ...      ...
2025-01-10    0.09    0.13    0.13     0.09     0.04
2025-01-13   -0.01    0.00    0.02     0.02     0.01
2025-01-14   -0.02   -0.03   -0.02    -0.01     0.01
2025-01-15   -0.03   -0.10   -0.14    -0.12    -0.10
2025-01-16   -0.01   -0.04   -0.06    -0.05    -0.04

[134 rows x 5 columns]

f. Be sure to compute the daily yield changes!

In [ ]:
plt.figure(figsize=(12, 6))
for column in daily_yield_changes.columns:
    plt.plot(daily_yield_changes.index, daily_yield_changes[column], label=column)

plt.title('Figure 3.3: Daily Yield Changes for U.S. Treasury Securities')
plt.xlabel('Date')
plt.ylabel('Daily Yield Change (%)')
plt.legend(title="Maturities")
plt.grid(True)
plt.show()
No description has been provided for this image

g. Re-run the Principal Components using EITHER the correlation or covariance matrix.

Answer of question g

Eigenvalues

In [ ]:
yields_df_cov = daily_yield_changes.cov()
eigenvalues, eigenvectors = LA.eig(yields_df_cov)
sorted_indices=np.argsort(eigenvalues)[::-1]
eigenvectors=eigenvectors[:,sorted_indices]
eigenvalues=eigenvalues[sorted_indices]
eigenvalues
Out[ ]:
array([1.27432534e-02, 1.86871188e-03, 2.76677987e-04, 1.32902058e-04,
       3.25563762e-05])

Eigenvectors

In [ ]:
eigenvectors
Out[ ]:
array([[-0.366009  , -0.535143  , -0.73188899,  0.20908883, -0.01672587],
       [-0.49679164, -0.45651126,  0.40733949, -0.61471393, -0.03157313],
       [-0.50391522,  0.02895065,  0.40052634,  0.62880218,  0.43522153],
       [-0.45748966,  0.36706634,  0.02217696,  0.15266311, -0.79508973],
       [-0.39497866,  0.60798323, -0.37080976, -0.39963788,  0.42087758]])

Principal components

In [ ]:
principal_components =yields_df.dot(eigenvectors)
principal_components.columns = ["PC_1","PC_2","PC_3","PC_4","PC_5"]
principal_components.dropna(inplace=True)
principal_components.head()
Out[ ]:
PC_1 PC_2 PC_3 PC_4 PC_5
date
2024-06-24 -10.032590 -0.532780 -1.633712 -0.245560 0.088697
2024-06-25 -9.975655 -0.525469 -1.659190 -0.216314 0.089371
2024-06-26 -10.128439 -0.479133 -1.660047 -0.225135 0.083761
2024-06-27 -10.075749 -0.482554 -1.647428 -0.240712 0.086957
2024-06-28 -10.160836 -0.406276 -1.648127 -0.245083 0.082232

h. How do the variances of each component compare? In other words, how much variance is explained by Component 1, Component 2, Component 3, etc.?

In [ ]:
# Put data into a DataFrame
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,6))

# Work out explained proportion
df_eigval["Explained variance proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Cumulative variance proportion"] = df_eigval["Explained variance proportion"].cumsum()
#Format as percentage
df_eigval.style.format({"Explained variance proportion": "{:.2%}",
    "Cumulative variance proportion": "{:.2%}"})
Out[ ]:
  Eigenvalues Explained variance proportion Cumulative variance proportion
1 0.012743 84.65% 84.65%
2 0.001869 12.41% 97.06%
3 0.000277 1.84% 98.90%
4 0.000133 0.88% 99.78%
5 0.000033 0.22% 100.00%

The component 1, component 2, component 3, component 4 and component 5 explain 84.31%, 12.76%, 1.81%, 0.89% and 0.23%, respectively, of the total variability of our dataset. Cumulatively, the first two components explain 97.07% of the total data variability.

i. Produce a screeplot of the variance explained for each component.

In [ ]:
plt.figure(figsize=(8,5))
plt.plot(range(1,len(eigenvalues)+1), eigenvalues,marker="o",linestyle="--")
plt.title("Figure 3.4: Scree Plot")
plt.xlabel("Principal Components")
plt.ylabel("Eigenvalue")
plt.xticks(range(1,len(eigenvalues)+1))
plt.grid(True)
No description has been provided for this image

j. How does the screeplot from the uncorrelated data compare with the screeplot from the government data?

Answer

In the screeplot of uncorrelated data (Figure 3.2), we observe a gradual decline in the dashed line, indicating the need for many principal components to achieve a high proportion of explained variance. Conversely, in the screeplot of government data (Figure 3.4), the dashed line shows a more pronounced decline. This suggests that in correlated data, the first few principal components explain a larger proportion of variability compared to the later components.

4. Empirical Analysis of ETFs Pick a sector ETF (in the US, for example, XLRE)¶

a. Find the 30 largest holdings.

Answer

The 30 largest holdings of XLI Industrials are

In [ ]:
keys = ['XLI']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"}
def main(url):
    with requests.Session() as req:
        req.headers.update(headers)
        for key in keys:
            r = req.get(url.format(key))
            #print(f"Extracting: {r.url}")
            goal = re.findall(r'etf\\\/(.*?)\\', r.text)
            print(goal[:30])
main("https://www.zacks.com/funds/etf/{}/holding")
['GE', 'CAT', 'RTX', 'HON', 'UNP', 'BA', 'ETN', 'UBER', 'ADP', 'DE', 'LMT', 'UPS', 'GEV', 'TT', 'PH', 'WM', 'TDG', 'EMR', 'MMM', 'ITW', 'GD', 'NOC', 'FDX', 'CTAS', 'CSX', 'CARR', 'PCAR', 'JCI', 'NSC', 'CPRT']

b. Get at least 6 months of data (~ 120 data points).

Answer

130 recent data points of XLI are

In [ ]:
xlf = yfin.download(['GE', 'CAT', 'RTX', 'UBER', 'ETN', 'HON', 'UNP', 'ADP', 'DE', 'BA', 'LMT', 'GEV', 'UPS', 'TT', 'PH', 'WM', 'EMR', 'CTAS', 'ITW', 'TDG', 'MMM', 'GD', 'CSX', 'NOC', 'FDX', 'CARR', 'PCAR', 'NSC', 'URI', 'JCI'])['Close']
xlf.index = pd.to_datetime(xlf.index).strftime("%Y-%m-%d")
xlf=xlf.dropna().iloc[-130:]
xlf
[*********************100%***********************]  30 of 30 completed
Out[ ]:
Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT UBER UNP UPS URI WM
Date
2024-07-15 240.175369 179.110001 66.957771 343.134705 34.364563 178.026718 368.950012 115.904816 325.018585 302.040466 ... 101.046936 541.951538 100.543625 1176.302490 340.164673 72.290001 233.394363 140.607178 704.445190 214.500183
2024-07-16 243.323074 186.050003 69.462715 357.831818 35.089600 180.666565 374.747284 117.787514 330.415192 310.457428 ... 105.241524 565.768982 102.581795 1212.824585 345.471649 74.300003 240.386597 143.709030 739.896057 219.356216
2024-07-17 246.718231 184.839996 67.176453 355.072998 35.208782 179.254501 382.311493 117.173164 309.584106 307.397614 ... 105.550797 555.083984 103.877922 1175.857300 334.618286 68.620003 241.207458 143.757812 720.557434 221.769363
2024-07-18 244.887009 180.229996 66.649628 353.316528 34.781704 189.014389 377.854370 115.478729 307.169037 303.624847 ... 106.294998 544.866577 102.740097 1180.072144 329.560699 66.260002 240.337143 141.202194 725.397095 221.550888
2024-07-19 244.461380 179.669998 65.844460 344.980530 34.453949 188.912292 375.293243 114.626556 309.971710 303.317871 ... 104.197701 541.862000 101.721008 1172.636963 329.331299 67.309998 239.654755 141.611862 715.628174 221.252960
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-01-13 290.200012 170.570007 67.110001 362.500000 31.889999 190.600006 429.910004 118.470001 340.880005 274.589996 ... 108.160004 637.989990 117.739998 1278.459961 376.980011 65.699997 226.759995 125.010002 689.150024 207.529999
2025-01-14 291.690002 167.020004 68.400002 371.570007 32.139999 192.279999 432.309998 119.790001 340.140015 277.619995 ... 110.199997 647.010010 119.470001 1300.729980 382.709991 64.839996 229.380005 125.839996 729.859985 207.169998
2025-01-15 293.369995 166.199997 68.800003 374.890015 32.459999 195.699997 428.880005 120.720001 342.579987 276.589996 ... 109.629997 656.320007 119.089996 1293.530029 385.549988 67.019997 230.860001 127.470001 741.489990 208.000000
2025-01-16 296.230011 168.929993 69.739998 380.549988 32.970001 198.050003 439.109985 123.290001 345.190002 277.369995 ... 109.480003 658.659973 120.459999 1322.020020 388.859985 68.580002 234.889999 129.940002 757.419983 211.009995
2025-01-17 296.179993 171.089996 69.660004 386.019989 32.730000 198.309998 455.440002 124.529999 346.279999 275.100006 ... 110.330002 669.460022 121.110001 1340.020020 389.190002 67.339996 234.259995 129.729996 771.000000 211.710007

130 rows × 30 columns

c. Compute the daily returns.

Answer

The daily returns are

In [ ]:
xlf_returns=xlf.dropna().pct_change().dropna()
xlf_returns
Out[ ]:
Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT UBER UNP UPS URI WM
Date
2024-07-16 0.013106 0.038747 0.037411 0.042832 0.021098 0.014828 0.015713 0.016243 0.016604 0.027867 ... 0.041511 0.043948 0.020271 0.031048 0.015601 0.027805 0.029959 0.022060 0.050325 0.022639
2024-07-17 0.013953 -0.006504 -0.032914 -0.007710 0.003397 -0.007816 0.020185 -0.005216 -0.063045 -0.009856 ... 0.002939 -0.018886 0.012635 -0.030480 -0.031416 -0.076447 0.003415 0.000339 -0.026137 0.011001
2024-07-18 -0.007422 -0.024940 -0.007842 -0.004947 -0.012130 0.054447 -0.011658 -0.014461 -0.007801 -0.012273 ... 0.007051 -0.018407 -0.010953 0.003584 -0.015114 -0.034392 -0.003608 -0.017777 0.006717 -0.000985
2024-07-19 -0.001738 -0.003107 -0.012081 -0.023594 -0.009423 -0.000540 -0.006778 -0.007379 0.009124 -0.001011 ... -0.019731 -0.005514 -0.009919 -0.006301 -0.000696 0.015847 -0.002839 0.002901 -0.013467 -0.001345
2024-07-22 0.004130 -0.004286 0.026570 0.004766 -0.005189 0.017784 0.003148 0.018586 0.027029 0.003656 ... 0.011595 0.013715 0.009338 0.014361 0.021082 0.006240 0.003920 0.000000 0.027427 0.002917
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-01-13 0.011679 -0.008314 0.009021 0.032764 0.003461 0.006867 0.052231 0.004494 -0.001669 0.014258 ... 0.014825 0.012618 0.022848 0.012954 -0.000371 -0.004093 0.006882 0.014033 0.027141 0.001351
2025-01-14 0.005134 -0.020813 0.019222 0.025021 0.007839 0.008814 0.005583 0.011142 -0.002171 0.011035 ... 0.018861 0.014138 0.014693 0.017419 0.015200 -0.013090 0.011554 0.006639 0.059073 -0.001735
2025-01-15 0.005760 -0.004910 0.005848 0.008935 0.009956 0.017787 -0.007934 0.007764 0.007173 -0.003710 ... -0.005172 0.014389 -0.003181 -0.005535 0.007421 0.033621 0.006452 0.012953 0.015935 0.004006
2025-01-16 0.009749 0.016426 0.013663 0.015098 0.015712 0.012008 0.023853 0.021289 0.007619 0.002820 ... -0.001368 0.003565 0.011504 0.022025 0.008585 0.023277 0.017456 0.019377 0.021484 0.014471
2025-01-17 -0.000169 0.012786 -0.001147 0.014374 -0.007279 0.001313 0.037189 0.010058 0.003158 -0.008184 ... 0.007764 0.016397 0.005396 0.013616 0.000849 -0.018081 -0.002682 -0.001616 0.017929 0.003317

129 rows × 30 columns

d. Compute the covariance matrix.

Answer

The covariance matrix is

In [ ]:
xlf_cov_matrix=xlf_returns.cov()
xlf_cov_matrix
Out[ ]:
Ticker ADP BA CARR CAT CSX CTAS DE EMR ETN FDX ... PCAR PH RTX TDG TT UBER UNP UPS URI WM
Ticker
ADP 0.000095 0.000023 0.000061 0.000074 0.000085 0.000066 0.000048 0.000079 0.000073 0.000050 ... 0.000078 0.000089 0.000061 0.000069 0.000040 0.000019 0.000070 0.000030 0.000098 0.000042
BA 0.000023 0.000459 0.000100 0.000094 0.000006 -0.000025 0.000086 0.000104 0.000119 0.000056 ... -0.000009 0.000104 0.000065 0.000122 0.000053 0.000130 0.000028 0.000026 0.000106 -0.000019
CARR 0.000061 0.000100 0.000312 0.000167 0.000109 0.000096 0.000098 0.000176 0.000216 0.000095 ... 0.000132 0.000184 0.000062 0.000154 0.000172 0.000160 0.000130 0.000046 0.000231 0.000062
CAT 0.000074 0.000094 0.000167 0.000325 0.000150 0.000077 0.000179 0.000226 0.000211 0.000124 ... 0.000185 0.000245 0.000078 0.000126 0.000098 0.000172 0.000125 0.000096 0.000303 0.000014
CSX 0.000085 0.000006 0.000109 0.000150 0.000253 0.000088 0.000093 0.000145 0.000103 0.000064 ... 0.000163 0.000167 0.000068 0.000092 0.000047 0.000112 0.000158 0.000081 0.000186 0.000045
CTAS 0.000066 -0.000025 0.000096 0.000077 0.000088 0.000269 0.000020 0.000083 0.000109 0.000035 ... 0.000106 0.000100 0.000048 0.000082 0.000087 0.000050 0.000069 0.000049 0.000142 0.000070
DE 0.000048 0.000086 0.000098 0.000179 0.000093 0.000020 0.000287 0.000106 0.000079 0.000100 ... 0.000123 0.000113 0.000059 0.000020 0.000016 0.000072 0.000077 0.000102 0.000169 -0.000004
EMR 0.000079 0.000104 0.000176 0.000226 0.000145 0.000083 0.000106 0.000327 0.000215 0.000121 ... 0.000162 0.000239 0.000062 0.000152 0.000126 0.000108 0.000112 0.000087 0.000259 0.000032
ETN 0.000073 0.000119 0.000216 0.000211 0.000103 0.000109 0.000079 0.000215 0.000381 0.000089 ... 0.000154 0.000243 0.000044 0.000186 0.000226 0.000220 0.000082 0.000052 0.000269 0.000066
FDX 0.000050 0.000056 0.000095 0.000124 0.000064 0.000035 0.000100 0.000121 0.000089 0.000344 ... 0.000104 0.000097 0.000044 0.000035 0.000053 0.000077 0.000093 0.000143 0.000135 0.000038
GD 0.000071 0.000066 0.000091 0.000099 0.000088 0.000083 0.000047 0.000091 0.000109 0.000030 ... 0.000076 0.000114 0.000104 0.000134 0.000062 0.000066 0.000076 0.000019 0.000138 0.000035
GE 0.000068 0.000106 0.000183 0.000149 0.000088 0.000078 0.000053 0.000149 0.000246 0.000051 ... 0.000084 0.000186 0.000098 0.000225 0.000193 0.000193 0.000075 -0.000014 0.000228 0.000036
GEV 0.000076 0.000127 0.000183 0.000227 0.000103 0.000065 0.000068 0.000247 0.000412 0.000099 ... 0.000167 0.000299 0.000064 0.000241 0.000271 0.000216 0.000079 0.000073 0.000285 0.000063
HON 0.000053 0.000051 0.000104 0.000068 0.000074 0.000056 0.000025 0.000084 0.000095 0.000052 ... 0.000067 0.000075 0.000020 0.000090 0.000072 0.000055 0.000078 0.000026 0.000077 0.000081
ITW 0.000059 0.000044 0.000095 0.000121 0.000089 0.000046 0.000088 0.000107 0.000076 0.000081 ... 0.000090 0.000113 0.000054 0.000060 0.000037 0.000061 0.000075 0.000062 0.000135 0.000015
JCI 0.000086 0.000056 0.000179 0.000190 0.000125 0.000110 0.000083 0.000191 0.000237 0.000111 ... 0.000163 0.000211 0.000063 0.000158 0.000151 0.000127 0.000110 0.000085 0.000248 0.000063
LMT 0.000045 0.000023 0.000026 0.000010 0.000025 0.000030 0.000017 0.000008 0.000003 -0.000017 ... 0.000007 0.000031 0.000090 0.000066 0.000035 -0.000034 0.000024 -0.000047 0.000028 0.000013
MMM 0.000070 0.000057 0.000151 0.000161 0.000125 0.000073 0.000082 0.000153 0.000119 0.000067 ... 0.000120 0.000155 0.000059 0.000086 0.000101 -0.000022 0.000120 0.000067 0.000155 0.000010
NOC 0.000048 0.000017 0.000009 0.000042 0.000054 0.000031 0.000046 0.000034 0.000007 0.000012 ... 0.000028 0.000051 0.000112 0.000057 0.000013 -0.000043 0.000031 0.000009 0.000068 -0.000019
NSC 0.000096 0.000042 0.000130 0.000174 0.000223 0.000112 0.000094 0.000159 0.000124 0.000100 ... 0.000146 0.000161 0.000076 0.000086 0.000061 0.000089 0.000176 0.000099 0.000201 0.000038
PCAR 0.000078 -0.000009 0.000132 0.000185 0.000163 0.000106 0.000123 0.000162 0.000154 0.000104 ... 0.000350 0.000170 0.000052 0.000090 0.000082 0.000086 0.000121 0.000168 0.000191 0.000074
PH 0.000089 0.000104 0.000184 0.000245 0.000167 0.000100 0.000113 0.000239 0.000243 0.000097 ... 0.000170 0.000334 0.000093 0.000159 0.000133 0.000162 0.000126 0.000070 0.000305 0.000029
RTX 0.000061 0.000065 0.000062 0.000078 0.000068 0.000048 0.000059 0.000062 0.000044 0.000044 ... 0.000052 0.000093 0.000168 0.000088 0.000038 -0.000014 0.000056 0.000010 0.000131 -0.000015
TDG 0.000069 0.000122 0.000154 0.000126 0.000092 0.000082 0.000020 0.000152 0.000186 0.000035 ... 0.000090 0.000159 0.000088 0.000299 0.000151 0.000160 0.000089 0.000029 0.000186 0.000050
TT 0.000040 0.000053 0.000172 0.000098 0.000047 0.000087 0.000016 0.000126 0.000226 0.000053 ... 0.000082 0.000133 0.000038 0.000151 0.000221 0.000132 0.000050 0.000018 0.000142 0.000069
UBER 0.000019 0.000130 0.000160 0.000172 0.000112 0.000050 0.000072 0.000108 0.000220 0.000077 ... 0.000086 0.000162 -0.000014 0.000160 0.000132 0.000739 0.000087 0.000038 0.000187 0.000033
UNP 0.000070 0.000028 0.000130 0.000125 0.000158 0.000069 0.000077 0.000112 0.000082 0.000093 ... 0.000121 0.000126 0.000056 0.000089 0.000050 0.000087 0.000172 0.000078 0.000142 0.000051
UPS 0.000030 0.000026 0.000046 0.000096 0.000081 0.000049 0.000102 0.000087 0.000052 0.000143 ... 0.000168 0.000070 0.000010 0.000029 0.000018 0.000038 0.000078 0.000286 0.000062 0.000038
URI 0.000098 0.000106 0.000231 0.000303 0.000186 0.000142 0.000169 0.000259 0.000269 0.000135 ... 0.000191 0.000305 0.000131 0.000186 0.000142 0.000187 0.000142 0.000062 0.000494 0.000018
WM 0.000042 -0.000019 0.000062 0.000014 0.000045 0.000070 -0.000004 0.000032 0.000066 0.000038 ... 0.000074 0.000029 -0.000015 0.000050 0.000069 0.000033 0.000051 0.000038 0.000018 0.000162

30 rows × 30 columns

e. Compute the PCA.

Answer

The principal component analysis(PCA) is

In [ ]:
xlf_returns_mean=xlf_returns.mean()

xlf_returns_std=xlf_returns.std()

xlf_standardized_data=(xlf_returns-xlf_returns_mean)/xlf_returns_std

xlf_standardized_cov=xlf_standardized_data.cov()

eigenvalues, eigenvectors = la.eig(xlf_standardized_cov)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

xlf_principal_components= xlf_standardized_data.dot(eigenvectors)
xlf_principal_components.columns=["PC_1","PC_2","PC_3","PC_4","PC_5","PC_6","PC_7","PC_8","PC_9","PC_10","PC_11","PC_12","PC_13","PC_14","PC_15","PC_16","PC_17","PC_18","PC_19","PC_20","PC_21","PC_22","PC_23","PC_24","PC_25","PC_26","PC_27","PC_28","PC_29","PC_30"]
xlf_principal_components
Out[ ]:
PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 PC_8 PC_9 PC_10 ... PC_21 PC_22 PC_23 PC_24 PC_25 PC_26 PC_27 PC_28 PC_29 PC_30
Date
2024-07-16 7.924030 0.137422 0.886763 -0.059148 1.805747 0.023687 0.014664 0.273473 -0.186904 -0.061284 ... 0.500468 -0.562224 0.032079 -0.623973 0.031814 -0.131576 -0.058823 -0.023292 -0.245373 -0.536470
2024-07-17 -3.543482 4.703898 3.583108 -1.650724 1.032094 0.083364 1.125461 -1.607877 0.442119 1.205232 ... 0.174183 0.885378 -0.356334 -0.211199 0.133578 -0.097412 0.437272 -0.542359 0.474256 -0.061649
2024-07-18 -2.319138 0.407285 -0.737715 -1.692023 -0.957578 -0.698202 -2.062343 -0.898443 -1.145080 -1.306339 ... -0.475764 -0.436102 0.476509 0.148484 0.000695 -0.065869 -0.973560 0.277110 0.003407 0.084452
2024-07-19 -2.762167 -1.247847 -0.884262 -0.052278 -0.624147 -0.217254 0.109672 1.256096 0.010304 -0.435093 ... -0.365277 1.091306 0.653740 0.187338 0.090358 0.539398 -0.171228 -0.061050 -0.093868 0.545500
2024-07-22 3.317883 -0.784245 -1.181744 -0.277742 0.015225 -0.492282 -0.304679 -0.970278 -0.283205 -0.771144 ... 0.144122 -0.096397 -0.115491 0.320641 -0.279635 0.171495 -0.064578 -0.400981 -0.228847 0.194881
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-01-13 3.924230 3.535143 1.238639 0.974713 0.713740 -1.254833 0.087407 -0.531806 1.612548 -0.358643 ... -0.471550 -0.189333 -0.098235 -0.333291 0.211624 0.395580 0.303054 0.054771 -0.147809 0.153371
2025-01-14 3.769834 0.396798 -0.001358 0.626505 -1.095203 -0.868700 -0.728517 -0.375547 0.356758 -0.806851 ... 0.534097 -0.694173 -0.401121 -0.351800 -0.730760 0.445435 0.047243 -0.352723 0.636313 -0.337180
2025-01-15 1.327423 -0.907914 0.023650 -0.546966 0.050550 0.662722 -0.957914 0.446743 -0.229821 -0.579190 ... 0.876482 0.589169 -0.393185 -0.431120 -0.698458 0.675571 0.231880 -0.102541 0.530894 0.146579
2025-01-16 3.884179 0.916024 0.753182 -0.524593 1.023519 0.528138 0.448983 0.508906 -0.124216 0.053304 ... -0.302446 0.109677 0.272910 0.005086 -0.397692 0.076441 -0.242917 -0.133467 -0.007336 0.064547
2025-01-17 1.386382 0.169151 -0.446350 1.317249 -0.021273 -0.813384 0.248580 -0.869154 0.785681 0.866992 ... 0.084466 -0.051597 -0.383539 0.269699 0.253224 -0.045663 0.066251 -0.244582 -0.413982 -0.034901

129 rows × 30 columns

f. Compute the SVD.

Answer

The singular value decomposition (SVD) is

In [ ]:
standardized_returns_dvd_sqrt_n=(xlf_standardized_data/math.sqrt(len(xlf_standardized_data)-1))
U_st_return, s_st_return, VT_st_return = np.linalg.svd(standardized_returns_dvd_sqrt_n)

Matrix of left singular vectors

In [ ]:
U_st_return
Out[ ]:
array([[-0.20930198,  0.0074209 , -0.04893938, ..., -0.01761956,
        -0.09612688, -0.01893472],
       [ 0.09359602,  0.25401361, -0.19774751, ..., -0.04017135,
         0.05278976,  0.07862361],
       [ 0.06125673,  0.02199364,  0.04071362, ...,  0.05197161,
         0.06485387, -0.00574327],
       ...,
       [-0.035062  , -0.04902796, -0.00130521, ...,  0.83354267,
        -0.01081503,  0.01220279],
       [-0.10259506,  0.04946591, -0.04156722, ...,  0.00562025,
         0.86803635, -0.00854122],
       [-0.03661932,  0.00913428,  0.02463354, ...,  0.05158599,
        -0.01561285,  0.8748643 ]])

Singular values

In [ ]:
s_st_return
Out[ ]:
array([3.34632245, 1.63680109, 1.60156236, 1.36331236, 1.12608502,
       1.06081308, 1.02599675, 0.90912456, 0.86740454, 0.85508141,
       0.82863216, 0.76399463, 0.68735184, 0.67355388, 0.63943669,
       0.61713275, 0.5982534 , 0.58915346, 0.54750511, 0.52876127,
       0.52132406, 0.47621634, 0.47233618, 0.44350506, 0.42235973,
       0.38939778, 0.38502787, 0.3386785 , 0.31066052, 0.25473011])

Eigenvalues With The SVD Method

In [ ]:
s_st_return**2
Out[ ]:
array([11.19787397,  2.67911782,  2.565002  ,  1.85862058,  1.26806748,
        1.1253244 ,  1.05266932,  0.82650747,  0.75239063,  0.73116421,
        0.68663126,  0.58368779,  0.47245256,  0.45367483,  0.40887927,
        0.38085283,  0.35790713,  0.3471018 ,  0.29976185,  0.27958848,
        0.27177878,  0.226782  ,  0.22310147,  0.19669673,  0.17838774,
        0.15163063,  0.14824646,  0.11470312,  0.09650996,  0.06488743])

Matrix of the right singular vectors

In [ ]:
VT_st_return.T
Out[ ]:
array([[-2.04561657e-01,  1.98823851e-01, -5.55835898e-02,
         1.95703312e-01, -4.11442942e-02,  3.09622025e-02,
        -1.13914459e-02,  7.67302825e-02, -6.84252241e-02,
        -1.49265844e-01, -1.89567279e-01,  4.04895427e-01,
        -2.48803611e-01, -1.27740381e-01,  5.60824078e-02,
         1.61507979e-02,  3.11498403e-02,  5.65149820e-01,
        -3.30417974e-02, -1.78431265e-01, -1.79512862e-01,
        -2.65947621e-01, -5.14301642e-02,  1.66646486e-01,
         3.58342229e-02,  1.64777695e-01,  2.12864359e-02,
        -4.39880829e-02, -2.00605095e-01,  4.86899024e-02],
       [-8.75596753e-02, -1.62552367e-02,  1.33533526e-01,
        -3.02663392e-01, -3.66292488e-01, -2.22945878e-01,
        -4.22875960e-01, -5.18377077e-02, -3.43113454e-01,
        -1.59571956e-01,  3.27669925e-01,  2.00612922e-01,
         1.88456955e-01, -1.53049320e-01,  2.48060692e-01,
         1.75296532e-02,  1.50254105e-01,  5.16648371e-02,
        -7.62257950e-03,  8.66222897e-03,  9.47027828e-02,
         1.27065675e-01, -1.26985433e-01,  2.71658549e-02,
        -2.13191620e-02, -1.28303416e-01, -8.58528564e-02,
         3.54596755e-02, -9.59063920e-02,  5.96915933e-02],
       [-2.19220724e-01, -1.31178853e-01,  7.43344686e-02,
         2.67186836e-02, -7.77282159e-02, -1.35446673e-01,
        -1.00326483e-01,  2.47455411e-01,  1.76274533e-01,
         3.06843367e-01,  1.76589638e-01, -3.51063871e-01,
        -9.57551770e-02, -2.63589136e-01,  6.32013190e-02,
         2.91386908e-01,  8.88583531e-04,  5.95289127e-02,
        -1.43869588e-01, -2.21495896e-03, -9.82678472e-02,
        -6.17294744e-02, -6.88371457e-02, -1.16340932e-01,
         1.28363565e-01,  2.22119515e-01, -1.47011841e-01,
        -4.76848980e-01, -3.48505784e-02,  1.16569752e-01],
       [-2.36482430e-01, -5.79039101e-02, -1.17079216e-01,
        -2.37636866e-01,  3.07116109e-03, -1.44043290e-02,
         1.29793623e-01,  1.53783590e-01,  9.65123199e-02,
        -9.21236370e-02,  1.76502278e-02,  2.88928546e-02,
         1.67104147e-02,  1.91141091e-01, -6.75668859e-02,
        -2.03189352e-01,  1.09802083e-01, -2.56474450e-01,
        -1.47451140e-02,  2.46539573e-02, -9.82361533e-02,
        -1.04639455e-01,  1.00802558e-01,  6.10823840e-01,
         1.06869685e-01, -2.69474630e-01, -7.91429705e-02,
        -3.28869050e-01, -2.05899971e-01,  6.95707534e-02],
       [-2.06471728e-01,  8.98749208e-02, -2.25960499e-01,
         1.19540410e-01,  8.61513013e-02, -2.75552939e-01,
         1.94902731e-01, -1.98725092e-01,  8.21253027e-02,
        -1.65166193e-01, -1.08920352e-01, -5.91850276e-02,
         2.34985525e-01, -1.13265797e-01,  1.15016596e-01,
         4.94218844e-02, -1.94032047e-01,  7.41560734e-02,
         1.90880433e-01,  2.57685369e-03,  2.31067119e-01,
         2.03624124e-01, -8.27938976e-02, -1.82310381e-01,
        -1.97902441e-01, -1.77810477e-01,  1.95232447e-01,
        -3.26350305e-01, -3.64378868e-01, -1.65207948e-01],
       [-1.42384977e-01,  2.91392582e-02, -2.03576516e-03,
         3.14924169e-01,  9.27019852e-02,  1.19527056e-01,
         3.12481738e-01,  1.44507587e-01, -3.98359288e-01,
         3.92471001e-01,  3.98005685e-01,  2.15862314e-01,
         1.77596493e-01,  1.63191377e-01,  1.51629732e-01,
        -1.18180403e-01, -9.74547679e-02,  9.92343287e-02,
        -6.06864801e-02,  1.29964814e-01, -2.54336333e-02,
         1.04086912e-01,  2.13233570e-01, -5.64591867e-02,
        -9.62924340e-03, -8.30309351e-02, -1.19439927e-01,
        -5.61997351e-02, -4.73067280e-03, -3.06645544e-02],
       [-1.39310180e-01,  5.78051879e-02, -2.37619548e-01,
        -2.93482523e-01, -1.98464529e-01,  1.10003009e-01,
        -7.87432864e-04,  1.73208382e-01,  4.38547652e-01,
        -3.55654308e-02,  3.69531202e-01,  2.27597619e-01,
         8.98075476e-02, -4.71468788e-02, -2.21171049e-01,
        -1.42186259e-01, -3.48309398e-01,  7.98202468e-02,
         5.28853171e-02, -6.32855403e-02, -1.79673474e-01,
        -2.21019570e-02,  1.00818655e-01, -2.23210552e-01,
        -1.95059142e-01,  1.27706549e-02,  7.63411448e-02,
         9.05747243e-02,  1.05013195e-01,  3.57654688e-02],
       [-2.28678290e-01, -9.55947369e-02, -4.13407143e-02,
        -1.21923614e-01,  4.74387979e-02,  9.94365524e-03,
        -2.40600513e-02,  2.15640296e-01, -1.74596705e-01,
        -1.94622359e-01, -1.65686339e-01, -1.64526645e-01,
        -1.82725760e-02,  2.07692421e-01,  4.59511234e-01,
        -4.36287710e-02, -2.62839700e-01, -1.24797945e-01,
         3.03612490e-01, -1.23968657e-01, -2.42453296e-01,
        -2.16491774e-01,  2.74404231e-02, -2.88219371e-01,
         3.04894510e-01, -4.75793005e-02, -1.56389386e-02,
         1.06891019e-01,  5.80275717e-02, -3.48138577e-02],
       [-2.25824364e-01, -2.43276323e-01,  2.00042897e-01,
        -5.02288750e-02,  5.84127444e-02,  9.49781275e-02,
         4.93019371e-02,  4.78210633e-02, -4.02815731e-02,
        -6.64961218e-02,  9.50484366e-03,  1.45041378e-01,
         5.16899645e-02, -4.74051485e-02, -1.66981378e-01,
         2.23871948e-01, -2.92316290e-02,  5.18354101e-02,
         2.78461881e-01,  1.82147244e-01, -6.69450572e-02,
         1.71143528e-02, -2.62634108e-01,  1.86841331e-01,
        -8.89276702e-02,  8.58434536e-02, -9.22271546e-02,
        -5.66547199e-02,  2.86782828e-01, -6.22799394e-01],
       [-1.26969087e-01, -8.45754237e-02, -2.21994626e-01,
        -7.33375274e-02, -2.49452427e-01,  2.90954779e-01,
        -2.81217879e-01, -6.79256050e-02, -3.78913285e-02,
         4.85740089e-01, -4.92283983e-01,  9.44215378e-02,
         1.47063602e-01,  2.70014595e-02,  6.76940539e-02,
        -1.77835000e-01,  1.34616097e-01, -5.48952633e-02,
         1.31945234e-01,  6.37828447e-02, -1.20019250e-01,
         1.14673736e-01, -1.30201705e-02, -6.94517217e-02,
        -2.23183886e-01,  6.64095010e-02,  3.66102472e-04,
        -1.16514820e-01, -3.63321846e-02,  2.62837496e-03],
       [-1.99076713e-01,  2.46626307e-01,  1.34481924e-01,
         7.84470318e-02, -1.81681359e-01, -3.77343810e-02,
         1.33954024e-01, -6.32259648e-02, -2.13025773e-01,
        -5.26149716e-02,  5.94463994e-02, -1.92305546e-03,
        -2.83401409e-01, -1.07375499e-01, -3.15138549e-01,
        -7.26859275e-02,  6.64558256e-02, -3.46076373e-01,
         2.42921779e-01,  4.23609675e-01, -1.90962208e-01,
        -9.39287059e-02, -1.38687182e-01, -1.99078367e-01,
        -2.09614633e-02,  1.12623921e-01,  1.00281967e-01,
         4.55634483e-02, -2.20208388e-01,  1.73952476e-01],
       [-1.88988595e-01, -9.78581444e-03,  3.39915610e-01,
        -6.39030441e-02,  3.83528771e-02,  1.52587557e-02,
         1.35043256e-02, -1.48484245e-01,  2.09773492e-01,
         1.42582693e-01, -1.78567540e-02, -6.80169563e-02,
        -8.46550924e-02, -9.84236016e-02,  3.84921845e-02,
        -4.76108269e-01, -6.01006316e-02,  2.90070340e-01,
         2.91297449e-01,  1.39686221e-01,  4.41018991e-01,
        -8.65904310e-02, -1.42087023e-02,  1.13984367e-02,
         1.83432368e-01, -8.50664558e-02, -1.11475338e-01,
        -1.22588454e-02,  1.69850399e-01,  1.71978376e-01],
       [-1.71947105e-01, -1.90453198e-01,  2.11900954e-01,
        -1.18373171e-01,  2.82492470e-01,  2.15197454e-01,
        -1.06383605e-01, -2.35931812e-01,  3.26388468e-02,
        -2.25839050e-01, -1.43098139e-01,  2.66711999e-01,
         9.32397601e-02, -1.43139500e-01, -1.02925290e-01,
         9.58393096e-02, -1.01494275e-01, -5.84955567e-02,
        -1.97560690e-01,  1.49531119e-01, -1.06840680e-01,
         2.80263242e-01,  3.91081117e-01, -1.73889019e-01,
         2.59542948e-01,  1.06566458e-01, -1.37641990e-01,
        -1.04624585e-01, -1.44886175e-01,  1.07947210e-01],
       [-1.53083550e-01, -5.91480454e-02,  2.67892625e-02,
         3.41798851e-01, -2.82166795e-01, -1.82789851e-01,
        -2.60406976e-01,  2.38619386e-01,  4.81488741e-02,
        -1.92080116e-01, -7.60675279e-02, -1.14310072e-01,
         2.04852437e-01,  4.69814662e-01, -3.36766748e-01,
         5.91316366e-02,  7.41217806e-02,  1.43896736e-01,
         9.21074234e-02,  7.56327784e-02,  1.79069605e-01,
         6.61167749e-02,  2.75134538e-01, -3.66322005e-02,
         8.15919260e-02,  1.33963865e-01, -1.29388103e-02,
         3.35060918e-02, -1.10076451e-02,  1.33872909e-02],
       [-2.14899992e-01,  9.59846529e-02, -1.89326552e-01,
        -1.03093220e-01, -1.22598498e-01, -2.31204750e-02,
         1.49973294e-02,  1.58171315e-01,  5.70577855e-04,
         7.75680110e-02, -8.06960615e-02,  5.93772075e-02,
        -6.29481275e-01,  1.96003027e-02,  1.08869079e-01,
         2.18228329e-01, -5.72110147e-02, -4.45417563e-02,
        -5.61937150e-02,  4.58434630e-02,  2.88618918e-01,
         4.34485704e-01,  1.81457499e-01,  4.16830260e-02,
         2.68495629e-02, -1.59884689e-01,  6.24146292e-04,
         1.41927066e-01,  1.08167755e-01, -9.88696472e-02],
       [-2.52539697e-01, -1.20180303e-01,  4.04953794e-02,
         3.75220846e-02,  1.11412102e-01,  1.27979092e-01,
        -3.04375322e-02,  8.15217111e-02, -4.08375030e-02,
        -3.84248627e-02, -8.95149263e-02,  1.63217436e-02,
         2.79134445e-03,  1.58180058e-01, -2.31349066e-01,
         4.53742373e-02,  1.17370360e-01,  1.17099833e-01,
        -3.51508959e-01, -1.32644395e-02, -2.67992486e-02,
        -2.70442033e-02, -4.33076122e-01, -2.92528925e-01,
         4.09622709e-03, -5.58483663e-01,  8.52581741e-02,
        -1.37985865e-02,  1.29141766e-01,  1.68752958e-01],
       [-7.41435166e-02,  4.19799116e-01,  2.47881319e-01,
         1.00055992e-01, -4.11912379e-02,  9.72157218e-02,
        -9.67990394e-02, -2.19995575e-02,  3.60837736e-01,
        -2.67984532e-03,  4.75541414e-02,  4.84810656e-02,
         1.23377815e-01,  2.06365299e-01,  3.83815279e-01,
         4.95039625e-03,  7.22957868e-02, -5.51080892e-02,
        -1.98934424e-01,  2.64510159e-01, -1.50739016e-01,
         1.53755809e-01, -1.28158206e-01,  9.15382727e-02,
         1.52251119e-01,  7.18597013e-02,  3.78841262e-01,
        -3.96300997e-02,  9.57860900e-02, -1.19422561e-01],
       [-1.44014401e-01,  4.53247089e-02, -8.54670595e-02,
        -2.25795894e-02,  5.25781307e-01, -1.47671044e-01,
        -4.52492559e-01, -2.77384163e-02,  6.48226633e-02,
         1.38899559e-01,  2.28500988e-01,  4.45788341e-02,
        -2.15804413e-01,  1.60998289e-01, -3.00874994e-02,
        -1.88083103e-01,  2.24669194e-01, -9.39300541e-02,
        -8.11214302e-03, -5.24787001e-02,  5.34114916e-02,
        -1.79716108e-01,  5.81899584e-02, -1.54881731e-01,
        -1.30280403e-01,  7.02563208e-02,  1.43764132e-02,
         4.71480711e-02, -2.12021747e-01, -2.91917932e-01],
       [-1.04377302e-01,  4.88605670e-01,  7.35385020e-02,
        -6.95850729e-02,  5.17880173e-02,  1.80559677e-01,
        -1.73245015e-02, -1.38533077e-01,  4.90661297e-02,
         9.92291113e-04, -1.44277837e-02, -1.23916460e-02,
         8.52186534e-02,  2.73273008e-01, -2.65472092e-02,
         2.37178871e-01, -1.47784639e-01, -1.11324486e-01,
         4.99609660e-02, -2.03637261e-01,  1.25340358e-01,
         9.78733850e-03, -2.10568423e-01,  2.49502847e-03,
        -1.22056071e-01,  8.04173808e-02, -6.08779037e-01,
        -1.57635096e-02, -5.44018111e-02,  1.00706845e-01],
       [-1.97547406e-01,  7.66694401e-02, -2.48797277e-01,
         8.50090798e-02,  2.78820605e-01, -3.20073310e-01,
        -1.24487941e-01, -1.71622083e-01, -1.08740575e-01,
         8.95255739e-02,  1.96592299e-02,  1.57653274e-01,
         1.06050841e-01, -2.86470060e-02, -9.66582175e-02,
         9.87794614e-02, -4.89274276e-02, -1.06394686e-01,
         1.95342805e-01, -1.14505050e-01, -9.24022821e-02,
         1.40164094e-01, -8.63018745e-02,  1.89813601e-01,
         1.11102861e-01,  1.36927034e-01,  1.74305806e-01,
        -3.93511261e-02,  4.92054802e-01,  3.77883342e-01],
       [-1.87807573e-01, -6.30089153e-02, -2.19908875e-01,
         9.42306866e-02,  3.00184652e-02,  2.84970455e-01,
         1.72337690e-01, -1.12378295e-01,  1.34430974e-01,
        -2.61526732e-01,  2.23109605e-01, -2.43286778e-01,
         1.25931183e-02, -7.42338173e-02,  1.53056298e-01,
        -9.93023917e-03,  6.06212538e-01,  1.33570539e-01,
         2.14194559e-01, -6.06081430e-02, -1.38494873e-01,
         2.07939176e-01,  5.25760945e-02, -5.66008515e-02,
        -7.68021967e-02,  4.47309493e-02, -1.29499547e-01,
         1.09623798e-01,  7.28184989e-02,  9.58528698e-02],
       [-2.48162724e-01, -4.81933495e-02,  1.54687355e-02,
        -1.46342267e-01,  8.11939337e-02, -1.99169501e-02,
         1.43751950e-01,  1.15548232e-01, -7.66315458e-02,
        -1.72406066e-01, -1.31704424e-01,  9.88241609e-03,
         9.40454962e-02, -4.50238531e-03,  2.13941758e-01,
         8.00079853e-02,  5.80141981e-02, -8.89810639e-02,
        -2.75597013e-01,  2.43131644e-01,  2.92855654e-01,
        -3.42618628e-01,  1.56607537e-01, -8.46591721e-03,
        -4.98031883e-01,  2.42695780e-01,  6.20428796e-02,
         3.00863997e-02,  2.10248927e-01,  1.71345750e-01],
       [-1.51700649e-01,  4.09977378e-01,  9.26018987e-02,
        -1.20875603e-01, -6.94984520e-02,  1.09243036e-01,
         1.26381205e-02, -3.21615901e-02, -1.33923856e-01,
         1.18809677e-01, -5.17951398e-02, -1.13881591e-01,
         1.91590783e-01, -2.69077039e-01, -1.96730677e-01,
         1.73915680e-01,  1.97310212e-01, -4.28632156e-02,
         3.74327449e-02, -2.57801318e-01,  7.66605349e-02,
        -2.56724090e-01,  4.01848841e-01, -6.65235806e-02,
         1.53877023e-01, -2.41432706e-01,  1.97714295e-01,
        -2.97583774e-02,  1.53415416e-01, -2.24042663e-01],
       [-1.99883241e-01,  7.36309967e-03,  2.81106297e-01,
         4.88706935e-02, -1.15106811e-01, -6.23985703e-02,
        -1.50694099e-02, -2.37617785e-01, -1.90420829e-01,
        -1.09737823e-01,  4.80395204e-03, -3.15340722e-01,
        -1.79397499e-01,  7.16075222e-02, -5.97017199e-02,
        -3.44650969e-01, -1.94036906e-01,  2.91393848e-02,
        -2.24606200e-01, -3.39942144e-01, -2.70092565e-01,
         2.13763496e-01,  9.88731207e-02,  7.85880195e-02,
        -3.20811708e-01,  7.89066180e-02,  5.88012912e-02,
        -1.13883750e-01,  1.45039069e-01, -1.07755461e-01],
       [-1.89726874e-01, -2.00645896e-01,  3.06362260e-01,
         1.03664238e-01,  9.08834969e-02,  1.50654024e-01,
        -9.34164118e-02,  2.21001982e-03,  1.06786888e-01,
         1.95216332e-01,  7.53816913e-02, -9.96635871e-02,
         4.66076155e-02, -2.86990250e-02,  4.78487007e-02,
         3.24094821e-01, -1.56560084e-01, -1.61245557e-02,
         2.10641626e-01, -1.45865227e-01, -3.18253309e-02,
        -2.28120909e-03,  4.75283305e-02,  2.67317414e-01,
        -2.25837964e-01, -1.38059827e-01,  2.48087043e-01,
         3.85639028e-01, -3.20463040e-01,  2.36271481e-01],
       [-1.10655848e-01, -2.35324818e-01,  8.70816049e-02,
        -1.07464259e-01, -2.58547361e-01, -3.11510401e-01,
         3.14344926e-01, -4.66469923e-01,  1.89002448e-01,
         2.15889006e-01,  2.11074606e-02,  2.37100185e-01,
        -8.66949152e-02,  3.20811798e-01,  6.43791899e-02,
         1.73642049e-01,  1.90601712e-01, -3.62391114e-02,
        -5.31077177e-02, -1.55352629e-01, -5.76460876e-02,
        -1.91408259e-01,  6.25639883e-02, -1.56064515e-01,
         9.42695254e-02,  3.73819277e-02, -9.25441486e-03,
         5.45048695e-02, -1.61140961e-02, -5.29235942e-02],
       [-2.13704817e-01,  5.47278840e-02, -2.17551050e-01,
         1.63965178e-01, -3.97549664e-02, -2.74472181e-01,
        -5.10365649e-02, -1.49643745e-01,  1.17554785e-01,
         7.64481701e-02, -1.09275170e-01, -1.94592735e-01,
         1.55697848e-01, -2.53974164e-01,  1.94547387e-02,
        -7.05850535e-02, -1.20606388e-01,  3.68064253e-02,
        -2.60894214e-01,  2.83672951e-01, -2.07850254e-01,
        -9.13229681e-02,  3.42378824e-02,  1.44770270e-01,
         7.76689692e-02, -1.54100635e-01, -3.39602982e-01,
         4.53464400e-01, -4.87552182e-03, -1.50864039e-01],
       [-1.04445510e-01, -1.03930485e-01, -3.51353987e-01,
        -4.65297430e-03, -1.26362081e-01,  3.97891284e-01,
        -1.32714686e-01, -4.04996467e-01, -1.76930155e-01,
        -4.17530796e-02,  2.32489559e-01, -2.11332509e-01,
        -6.18631772e-02,  1.19987288e-01, -4.77552393e-02,
         8.02145893e-02, -2.33852444e-01,  6.69549974e-02,
        -1.40461565e-01,  1.06560115e-01,  2.27571947e-01,
        -2.08869624e-01, -9.97974603e-02,  1.32310066e-01,
         2.32574955e-01,  1.54477416e-01,  2.05018180e-01,
        -4.10627189e-02, -4.96404985e-02, -5.40940323e-02],
       [-2.38927801e-01, -1.18252394e-02, -1.69451130e-03,
        -1.85613351e-01,  3.34573957e-02, -2.30558467e-02,
         2.44002789e-01,  2.34080103e-01, -6.64081409e-02,
         9.39645831e-02, -6.60376242e-02, -1.73742482e-02,
         2.05033545e-01, -7.37659864e-02, -1.26079792e-01,
        -1.95043407e-01,  9.41344431e-02, -8.96160809e-02,
        -1.93669312e-01, -2.77182941e-01,  2.03192740e-01,
         2.13718784e-01, -2.73290112e-01,  2.93250785e-03,
         2.54605909e-01,  4.05348423e-01,  1.20116917e-01,
         2.95004917e-01, -2.07038143e-01, -8.16965520e-02],
       [-9.26843088e-02, -1.43711414e-01,  1.02962326e-02,
         5.32817020e-01, -1.88515754e-01,  1.36900630e-01,
        -1.00193790e-01,  1.64064697e-02,  1.67420857e-01,
        -1.32579642e-01,  4.96911897e-02,  2.64892788e-01,
        -3.58782210e-02, -2.51912216e-01,  1.14511692e-01,
        -1.30576792e-01, -2.61789839e-02, -4.92368779e-01,
        -5.76320838e-02, -2.80122198e-01,  2.06965217e-01,
        -1.19696792e-01, -5.69479674e-02, -2.40639658e-02,
         2.77121618e-02, -4.37756691e-02, -8.46245379e-02,
        -4.64828950e-02,  8.81483650e-02, -5.78901592e-02]])

The percentage of variance contribution each principal component catches.

In [ ]:
df_eigval = pd.DataFrame({"Eigenvalues":eigenvalues}, index=range(1,31))
df_eigval["Explained proportion"] = df_eigval["Eigenvalues"] / np.sum(df_eigval["Eigenvalues"])
df_eigval["Explained proportion"]=df_eigval["Explained proportion"].map('{:.2%}'.format)

df_eigval
Out[ ]:
Eigenvalues Explained proportion
1 11.197874 37.33%
2 2.679118 8.93%
3 2.565002 8.55%
4 1.858621 6.20%
5 1.268067 4.23%
6 1.125324 3.75%
7 1.052669 3.51%
8 0.826507 2.76%
9 0.752391 2.51%
10 0.731164 2.44%
11 0.686631 2.29%
12 0.583688 1.95%
13 0.472453 1.57%
14 0.453675 1.51%
15 0.408879 1.36%
16 0.380853 1.27%
17 0.357907 1.19%
18 0.347102 1.16%
19 0.299762 1.00%
20 0.279588 0.93%
21 0.271779 0.91%
22 0.226782 0.76%
23 0.223101 0.74%
24 0.196697 0.66%
25 0.178388 0.59%
26 0.151631 0.51%
27 0.148246 0.49%
28 0.114703 0.38%
29 0.096510 0.32%
30 0.064887 0.22%

Now that you have calculated, presented and plotted tasks from c to f, you must explain each transformation thoroughly. Write a paragraph of 500 words at minimum that explains why returns are important, compare and contrast PCA and SVD, explain what the eigenvectors, eigenvalues, singular values etc show us for the specific data, etc.

Answer

Returns are a measure of the performance of our investments and allow us to compare performances from different investments. PCA is primarily used for dimensionality reduction, while SVD has broader applications, including matrix approximation and dimensionality reduction (Hair 92). Both PCA and SVD produce orthogonal vectors and the same eigenvalues. PCA involves computing eigenvectors and eigenvalues of the covariance matrix, while SVD directly deals with singular values. The singular values obtained from SVD can be used to calculate a variance of each principal component (Schwarz et al 2256). Each eigenvector transforms the standardized dataset into a principal component, whichcan explain a portion of the variance of the dataset. The corresponding eigenvalue for the principal component is the variance of the whole data explained by the principal component. The sum of all the eigenvalues is the total variance of the data.

Conclusion¶

Ensuring data quality in financial analysis is of paramount importance. Poor-quality data, whether structured or unstructured, hampers accurate modeling and reliable decision-making. For instance, in the context of this assignment, structured data included currency symbols, placeholders, and improper classifications that hinder numerical operations. Similarly, unstructured data exhibited missing values and formatting issues that complicated assessment.

The comparative analysis of the Nelson-Siegel and Cubic-Spline models highlighted their respective strengths and weaknesses. The Nelson-Siegel model, while easier to interpret, struggled with certain maturities, whereas the Cubic-Spline model offered a more precise fit but lacked clear economic interpretation. Ethical considerations in data smoothing were addressed, emphasizing transparency and accuracy over potential misrepresentation.

Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) both proved essential in reducing dimensionality and understanding underlying data structures, offering insights into returns and variability in investment performance. The screeplots revealed the differences in explained variance between uncorrelated and government data, emphasizing the significance of correlation in principal components.

References¶

Akinyemi, Kemi, et al. ''Yield curve extrapolation methods: Methodologies for valuing cash flows that extend beyond the maximum yield curve''. Society of actuaries, March (2019), https://www.soa.org/resources/research-reports/2019/yield-curve-report/

Copeland, Ronald M. "Income smoothing." Journal of accounting research, (1968): 101-116, https://doi.org/10.2307/2490073

Hair, J. F., et al. Multivariate Data Analysis. 7th Edn Prentice Hall International. Prentice Hall, (2009).

Huh, Y. U., et al. ''Data quality''. Information and software technology, 32.8 (1990): 559-565., https://www.sciencedirect.com/science/article/pii/095058499090146I

Liu, Grace. "Data quality problems troubling business and financial researchers: A literature review and synthetic analysis." Journal of Business & Finance Librarianship, 25.3-4 (2020): 315-371, https://doi.org/10.1080/08963568.2020.1847555

Nadinić, Berislav, and Damir Kalpić. "Data quality in finances and its impact on credit risk management and CRM integration." International Conference on Software and Data Technologies, Special Session on Applications in Banking and Financing, (3; 2008). 2008, https://doi.org/10.5220/0001879103270331

Nymand-Andersen, Per. ''Yield curve modelling and a conceptual framework for estimating yield curves: evidence from the European Central Bank's yield curves''. ECB Statistics Paper, 27, 2018, https://doi.org/10.2866/892636

Praveen, Shagufta, and Umesh Chandra. "Influence of structured, semi-structured, unstructured data on various data models." International Journal of Scientific & Engineering Research, 8.12 (2017): 67-69, https://www.researchgate.net/publication/344363081

Schwarz, Christian, et al. "Principal component analysis and singular value decomposition used for a numerical sensitivity analysis of a complex drawn part." The International Journal of Advanced Manufacturing Technology, 94 (2018): 2255-2265. https://link.springer.com/article/10.1007/s00170-017-0980-z

Sebastian-Coleman, Laura. Measuring data quality for ongoing improvement: a data quality assessment framework. Newnes, 2012, https://www.sciencedirect.com/book/9780123970336/measuring-data-quality-for-ongoing-improvement

Strong, Diane M., et al. "Data quality in context." Communications of the ACM, 40.5 (1997): 103-110, https://doi.org/10.1145/253769.253804

Wahlstrøm, Ranik Raaen, et al. ''A comparative analysis of parsimonious yield curve models with focus on the Nelson-Siegel, Svensson and Bliss versions''. Computational Economics (2022): 1-38, https://doi.org/10.1007/s10614-021-10113-w

Wang, Richard Y., and Diane M. Strong. "Beyond accuracy: What data quality means to data consumers." Journal of management information systems, 12.4 (1996): 5-33, https://doi.org/10.1080/07421222.1996.11518099.

Useful links

Data Society

Atharva Arya

.

In [ ]:
%%capture
!wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
from colab_pdf import colab_pdf
colab_pdf('Group_Work_Project 1, group 8072.ipynb')
#set_matplotlib_formats('pdf', 'svg')