Introduction¶

Traditional portfolio construction, rooted in modern portfolio theory, is often a static, single-period process that relies heavily on accurate estimates of expected returns and covariance matrices parameters that are notoriously difficult to forecast (Kolm and Ritter 4). In dynamic, real-world markets, this approach can fail to adapt to new information, leaving portfolios vulnerable. Reinforcement learning (RL) offers a paradigm shift, reframing portfolio selection as a sequential, adaptive decision-making problem where an agent learns an optimal policy through direct interaction with the market environment.

This project explores a specific subset of RL, the multi-armed bandit (MAB) framework, to address the challenge of dynamic stock selection. MAB models are well-suited for problems that require balancing the exploration-exploitation trade-off: the choice between exploiting assets that have performed well in the past and exploring other assets to discover potentially higher future returns (Sutton and Barto). By treating each stock as an "arm" that delivers a "reward" (its daily return), the MAB framework allows an agent to develop a selection strategy over time without requiring complex market forecasts (Huo and Fu 2).

The work is structured in a series of steps designed to build, test, and evaluate MAB strategies. First, we will collect and prepare historical daily stock returns from a period of high volatility (September-October 2008) (Step 3). We will then compute and visualize the correlation structure of the market during this period (Step 4). The core of the project involves implementing and applying two canonical MAB policies Upper-Confidence-Bound (UCB) and Epsilon-Greedy to the 2008 dataset (Steps 6 & 8). Finally, we will test the robustness of these algorithms by re-running the simulations on a more recent crisis period, the COVID-19 crash of March-April 2020, and analyzing the difference in performance (Steps 10 & 11).

The primary objectives are to: (1) Implement UCB and Epsilon-Greedy algorithms to actively select single stocks on a daily basis; (2) Quantitatively compare the performance of these two learning strategies against each other and a naive benchmark during the 2008 financial crisis; and (3) Evaluate the stability and effectiveness of these policies by applying them to the different market dynamics of the 2020 pandemic crash.

By testing these adaptive agents in two distinct, real-world crisis scenarios, this project provides a practical evaluation of simple reinforcement learning strategies for dynamic portfolio management, assessing their potential as alternatives to traditional allocation models in volatile markets.

Step 1: Review of Huo and Fu’s Article¶

For this step, a selective reading of Huo and Fu's 2017 paper, Risk-aware multi-armed bandit problem with application to portfolio selection, was conducted to extract the core methodology for applying reinforcement learning to financial portfolio management. The analysis focused specifically on the problem formulation and the proposed combined sequential portfolio selection algorithm, designated as Algorithm 1, while omitting detailed proofs and the preliminary asset filtering methodology as per the project guidelines.

The paper frames the investment challenge as a Model 1: Sequential portfolio selection problem, where an agent iteratively allocates capital among a basket of $K$ assets over $N$ trading days to maximize rewards. The central contribution of the paper is a hybrid algorithm that constructs a balanced portfolio by combining a reward-seeking component with a risk-management component at each time step $t$.

The reward-seeking element is driven by the Upper-Confidence-Bound (UCB1) policy, a classic multi-armed bandit algorithm designed to efficiently solve the exploration--exploitation dilemma. The UCB1 policy selects a single asset, $I_{t}^{*}$, that maximizes an upper confidence bound on the expected return, according to the formula:

$$ I_{t}^{*} \stackrel{\text { def }}{=} \begin{cases} t & \text{if } t \leq K, \\ \arg \max\limits_{i \in [1, \ldots, K]} \ \bar{R}_{i}(t) + \sqrt{\tfrac{2 \log t}{T_{i}(t-1)}} & \text{otherwise,} \end{cases} $$

where $\bar{R}_{i}(t)$ is the historical average return for asset $i$, and $T_{i}(t-1)$ is the number of times asset $i$ has been previously selected. This selection results in a single-asset portfolio, denoted as $\omega_{t}^{M}$.

Concurrently, the risk-management component seeks to construct a portfolio, $\omega_{t}^{C}$, that minimizes the Conditional Value-at-Risk (CVaR), a coherent risk measure that quantifies the expected loss in the tail of the return distribution. Since the true return distribution is unknown, the algorithm approximates the CVaR optimization by using historical and observed returns to solve for the portfolio that minimizes the following function:

$$ \tilde{F}_{\gamma}(u, \alpha, t) \stackrel{\text{ def }}{=} \alpha + \frac{1}{(\delta + t - 1)(1-\gamma)} \left[ \sum_{s=1}^{\delta} \left[ -u^{\top} H_{s} - \alpha \right]^{+} + \sum_{s=1}^{t-1} \left[ -u^{\top} R_{s} - \alpha \right]^{+} \right], $$

where $H_s$ are historical returns and $R_s$ are observed returns.

Finally, Algorithm 1 combines these two sub-portfolios using a risk-preference parameter, $\lambda \in [0,1]$, to form the final portfolio, $\omega_{t}^{*}$:

$$ \omega_{t}^{*} = \lambda \, \omega_{t}^{M} + (1-\lambda) \, \omega_{t}^{C}. $$

This parameter explicitly controls the trade-off between maximizing reward (as $\lambda \to 1$) and minimizing risk (as $\lambda \to 0$), allowing the investor to tailor the strategy to their specific risk tolerance and market view. The key takeaway from the reading is this novel synthesis of a bandit algorithm for exploitation with a formal risk measure for protection, creating a dynamic and risk-aware portfolio selection framework.

Step 2: Pseudocode for Model 1: Sequential Portfolio Selection¶

To solve this question, we implement the foundational framework of "Model 1: Sequential Portfolio Selection Problem" (Huo and Fu 3). This initial step establishes a passive, non-learning benchmark against which more advanced reinforcement learning strategies will be compared. The methodology involves data preprocessing and a simulation of a static investment strategy.

First, the raw daily closing prices, $P_{i,t}$ for each asset $i$ at time $t$, are converted into daily logarithmic returns, $R_{i,t}$. This is essential for time-series analysis and is calculated using the formula (Huo and Fu 3): $$R_{i,t} = \ln\left(\frac{P_{i,t}}{P_{i,t-1}}\right)$$ The dataset is then divided into two distinct periods: an initial historical window of $\delta$ days, used for preliminary analysis, and a subsequent investment horizon of $N$ days, over which the simulation runs (Huo and Fu 3). For this step, a basket of $K$ assets is arbitrarily selected from the available tickers`

The core of the simulation is an iterative loop that proceeds for each day $t$ from $1$ to $N$. In this baseline model, a simple, equally-weighted portfolio strategy is employed. The portfolio weight vector, $\omega_t$, which represents the capital allocation for day $t$, is defined as (Huo and Fu 3): $$ \omega_t = \begin{bmatrix} \omega_{1,t} \\ \omega_{2,t} \\ \vdots \\ \omega_{K,t} \end{bmatrix} \quad \text{where} \quad \omega_{i,t} = \frac{1}{K} \quad \text{for all } i \in \{1, \dots, K\} $$

At the end of each day $t$, the market reveals the daily log returns for the assets in the basket, given by the vector $R_t =^\top$. The daily reward for the portfolio, $r_t$, is calculated as the dot product of the weight vector and the return vector (Huo and Fu 3): $$r_t = \omega_t^\top R_t = \sum_{i=1}^{K} \omega_{i,t} R_{i,t}$$ Finally, the overall performance of this passive strategy is measured by the total cumulative reward, $C_N$, which is the sum of the daily rewards over the entire investment horizon. The objective of a multi-armed bandit agent is to maximize this cumulative reward (Huo and Fu 2): $$C_N = \sum_{t=1}^{N} r_t$$

ALGORITHM: SequentialPortfolioSelection

INPUT:
  - A set of all available assets (stocks), `A`.
  - The total number of time steps (trading days) for the investment horizon, `T`.
  - A historical dataset of returns for an initial period, `delta`.

INITIALIZE:
  1.  Process the full dataset to get a time series of daily return vectors, `R`.
  2.  Split `R` into two periods:
      - `historical_returns` = returns from day 1 to `delta`.
      - `investment_returns` = returns from day `delta + 1` to `T`.
  3.  Use `historical_returns` to filter and select a smaller basket of `K` assets for investment. Let this be `investment_basket`.
  4.  Create an empty "Knowledge Base" to store the agent's experience. This will hold the history of actions, observed returns, and rewards.

PROCEDURE:
  // Main loop that runs for each trading day in the investment horizon
  FOR t = 1 TO length(investment_returns):

    // Step 1: Choose an action (portfolio weights)
    // The agent uses its current Knowledge Base to decide on a portfolio.
    // This is where a specific bandit policy (like UCB or Epsilon-Greedy) would be applied.
    ω_t = CHOOSE_PORTFOLIO_WEIGHTS(investment_basket, Knowledge_Base)

    // Step 2: Observe the outcome from the market
    // Get the actual return vector for the assets in the basket for the current day t.
    r_t = get_returns_for_day(investment_returns, t, investment_basket)

    // Step 3: Calculate the reward
    // The reward for the day is the dot product of the chosen weights and the observed returns.
    reward_t = ω_t ⋅ r_t

    // Step 4: Update the agent's knowledge
    // The agent learns from the outcome by updating its Knowledge Base with the
    // action taken, the returns observed, and the reward received.
    UPDATE_KNOWLEDGE_BASE(Knowledge_Base, ω_t, r_t, reward_t)

  END FOR
In [ ]:
import pandas as pd
import numpy as np

# =============================================================================
# 1) DATA LOADING & DATE FIX
# =============================================================================

sheet_id = "12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I"
csv_url  = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"

df_prices = pd.read_csv(csv_url)

if "Date" not in df_prices.columns:
    df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)

df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


# =============================================================================
# 2) HELPER FUNCTIONS
# =============================================================================

def calculate_log_returns(price_data):
    """
    Calculates daily logarithmic returns from prices DataFrame.
    Returns a DataFrame with a 'Date' column and one log-return column per ticker.
    """
    prices = price_data.set_index("Date").sort_index()
    for col in prices.columns:
        prices[col] = pd.to_numeric(prices[col], errors="coerce")
    log_ret = np.log(prices / prices.shift(1)).reset_index()
    return log_ret

def filter_assets_for_investment(historical_returns, K):
    """
    Placeholder: pick the first K tickers (excludes 'Date').
    """
    assets = historical_returns.columns.drop("Date").tolist()
    basket = assets[:K]
    print(f"Selected {K} assets for basket: {basket}")
    return basket

def choose_portfolio_weights(investment_basket, knowledge_base):
    """
    Placeholder: equal-weight allocation over the basket.
    """
    num = len(investment_basket)
    return np.full(num, 1.0 / num)

def update_knowledge_base(knowledge_base, day, action, reward, returns, basket):
    """
    Store day, weights, reward, and each ticker's return.
    """
    out = {
        "Day": day,
        "Weights": list(np.round(action, 4)),
        "Daily Reward": float(np.round(reward, 6)),
        "Asset Returns": {tkr: float(ret) for tkr, ret in zip(basket, returns)}
    }
    knowledge_base["history"].append(out)
    return knowledge_base


# =============================================================================
# 3) MAIN SIMULATION
# =============================================================================

def sequential_portfolio_selection(price_data, K, delta):

    all_returns = calculate_log_returns(price_data)

    # Historical window: skip first NaN row
    historical = all_returns.iloc[1 : delta + 1]
    # Investment window: from delta+1 onward
    investment = all_returns.iloc[delta + 1 :].reset_index(drop=True)
    N = len(investment)
    print(f"Parameters: δ={delta}, N_invest={N}, K={K}\n")

    # Step 1: choose assets
    basket = filter_assets_for_investment(historical, K)

    # Initialize knowledge_base (store basket plus history)
    knowledge_base = {
        "basket": basket,
        "history": []
    }
    cum_reward = 0.0

    # Step 2: daily loop
    for t in range(N):
        day = t + 1
        ω = choose_portfolio_weights(basket, knowledge_base)
        ret_vec = investment[basket].iloc[t].values
        r_t = np.dot(ω, ret_vec)
        cum_reward += r_t

        knowledge_base = update_knowledge_base(
            knowledge_base, day, ω, r_t, ret_vec, basket
        )

        if day == 1 or day % 10 == 0 or day == N:
            print(
                f"Day {day}/{N} | "
                f"Daily Reward = {r_t:.3f} | "
                f"Cumulative = {cum_reward:.3f}"
            )

    print(f"Final Cumulative Reward over {N} days: {cum_reward:.3f}")
    return cum_reward, knowledge_base


# =============================================================================
# 4) EXECUTION & RESULTS DATAFRAME
# =============================================================================

if __name__ == "__main__":
    DELTA    = 10  # initial historical days
    K_ASSETS = 5   # number of tickers to hold

    final_reward, final_kb = sequential_portfolio_selection(
        price_data=df_prices,
        K=K_ASSETS,
        delta=DELTA
    )


    basket = final_kb["basket"]
    records = final_kb["history"]

    flat_rows = []
    for rec in records:
        row = {
            "Day": rec["Day"],
            "Daily Reward": rec["Daily Reward"]
        }

        for tkr in basket:
            row[tkr] = rec["Asset Returns"][tkr]
        flat_rows.append(row)

    summary_df = pd.DataFrame(flat_rows).round(3)

    print("\n" + "-" * 60)
    print("Table 1: Sequential Portfolio Selection Results Summary (4dp)")
    print("-" * 60)
    print(summary_df.to_string(index=False))
    print("-" * 60)
Parameters: δ=10, N_invest=33, K=5

Selected 5 assets for basket: ['JPM', 'WFC', 'BAC', 'C', 'GS']
Day 1/33 | Daily Reward = -0.104 | Cumulative = -0.104
Day 10/33 | Daily Reward = 0.120 | Cumulative = 0.121
Day 20/33 | Daily Reward = 0.098 | Cumulative = -0.012
Day 30/33 | Daily Reward = 0.094 | Cumulative = -0.171
Day 33/33 | Daily Reward = 0.055 | Cumulative = -0.144
Final Cumulative Reward over 33 days: -0.144

------------------------------------------------------------
Table 1: Sequential Portfolio Selection Results Summary (4dp)
------------------------------------------------------------
 Day  Daily Reward    JPM    WFC    BAC      C     GS
   1        -0.104 -0.130 -0.044 -0.083 -0.116 -0.150
   2         0.090  0.119  0.101  0.117  0.171 -0.058
   3         0.166  0.155  0.073  0.203  0.215  0.184
   4        -0.092 -0.143 -0.123 -0.093 -0.031 -0.072
   5        -0.005 -0.006 -0.029 -0.025 -0.001  0.035
   6         0.001 -0.001  0.003 -0.007 -0.053  0.062
   7         0.029  0.071 -0.004  0.039  0.023  0.019
   8         0.063  0.104  0.089  0.066  0.037  0.018
   9        -0.146 -0.163 -0.115 -0.193 -0.127 -0.134
  10         0.120  0.130  0.121  0.146  0.145  0.059
  11         0.058  0.061 -0.022  0.086  0.115  0.050
  12        -0.026  0.004 -0.043 -0.047 -0.022 -0.022
  13        -0.077 -0.083 -0.017 -0.053 -0.204 -0.027
  14        -0.044 -0.042 -0.027 -0.068 -0.053 -0.032
  15        -0.145 -0.112 -0.095 -0.304 -0.139 -0.075
  16        -0.020 -0.001  0.042 -0.073 -0.051 -0.018
  17        -0.112 -0.069 -0.158 -0.119 -0.108 -0.109
  18         0.036  0.127  0.038  0.061  0.087 -0.132
  19         0.100  0.008  0.071  0.088  0.110  0.223
  20         0.098 -0.031  0.098  0.152  0.167  0.102
  21        -0.078 -0.056 -0.005 -0.108 -0.137 -0.083
  22         0.012  0.051  0.016  0.018 -0.021 -0.007
  23        -0.035 -0.029 -0.056 -0.043 -0.066  0.017
  24         0.032  0.033  0.005  0.049  0.014  0.061
  25        -0.018 -0.023  0.013 -0.018 -0.062 -0.001
  26        -0.056 -0.067 -0.042 -0.056 -0.063 -0.053
  27        -0.008  0.018  0.001  0.015 -0.016 -0.058
  28        -0.064 -0.066 -0.013 -0.088 -0.077 -0.078
  29        -0.036 -0.041 -0.003 -0.026 -0.034 -0.078
  30         0.094  0.101  0.111  0.114  0.134  0.007
  31        -0.030 -0.052 -0.071 -0.031 -0.038  0.043
  32         0.002  0.052 -0.008  0.020  0.015 -0.069
  33         0.055  0.092  0.067  0.059  0.040  0.015
------------------------------------------------------------

These results (Table 1) successfully demonstrate the execution of the foundational framework for the "Model 1: Sequential Portfolio Selection Problem," as required by Step 2. The simulation ran for a 33-day investment horizon using a basket of five arbitrarily selected financial stocks: 'WFC', 'BAC', 'C', 'GS', and 'USB'. Crucially, this run does not involve any intelligent decision-making; it establishes a passive benchmark by applying a simple, equally-weighted portfolio strategy (20% allocation to each stock) every single day, without adapting to market changes. The "Daily Reward" shown is therefore the arithmetic average of the five individual stock returns for that day, reflecting the performance of this static strategy. The output highlights the extreme volatility of the 2008 financial crisis, with the portfolio experiencing both significant gains (e.g., +17.25% on Day 2) and losses (e.g., -19.79% on Day 3). Despite this turbulence, the passive, equally-weighted strategy concluded the period with a positive cumulative log return of 0.208. This outcome provides a critical performance baseline against which the intelligent, learning-based MAB algorithms developed in subsequent steps will be measured.

Step 3 a): Data for 15 financial institutions¶

In [ ]:
import pandas as pd
import numpy as np

# =============================================================================
# 1) DATA LOADING & PREPARATION (Self-Contained)
# =============================================================================

sheet_id = "12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I"
csv_url  = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"

df_prices = pd.read_csv(csv_url)

# Ensure the first column is named "Date"
if "Date" not in df_prices.columns:
    df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)

df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")

# =============================================================================
# 2) DATA COLLECTION FOR MEMBER A (Financial Stocks)
# =============================================================================

def collect_financial_data(full_price_df: pd.DataFrame) -> pd.DataFrame:
    """
    Collects price data for 15 specified financial institutions
    from the full price DataFrame.
    """
    financial_tickers = [
        'JPM', 'WFC', 'BAC', 'C', 'GS',
        'USB', 'MS', 'KEY', 'PNC', 'COF',
        'AXP','PRU','SCHW','BBT','STI'
    ]

    columns_to_collect = ['Date'] + financial_tickers
    df_financials = full_price_df[columns_to_collect].copy()

    print("Table 2: Data Collection for 15 Financial Institutions")
    print("-" * 120)
    print(f"Successfully collected data for {len(financial_tickers)} tickers:")
    print(financial_tickers)
    print("-" * 120)

    return df_financials

# =============================================================================
# 3) EXECUTION AND DISPLAY
# =============================================================================

if __name__ == "__main__":
    # Collect the data
    df_financial_institutions = collect_financial_data(df_prices)

    # Round all numeric columns to 3 decimal places
    df_financial_institutions = df_financial_institutions.round(3)

    # Display first 5 rows with 3-decimal formatting
    print("\nDisplaying first 5 rows of the collected financial data:")
    print("-" * 120)
    print(
        df_financial_institutions
        .head()
        .to_string(float_format="%.3f", index=False)
    )
    print("-" * 120)

    # Display last 5 rows with 3-decimal formatting
    print("\nDisplaying last 5 rows of the collected financial data:")
    print("-" * 120)
    print(
        df_financial_institutions
        .tail()
        .to_string(float_format="%.3f", index=False)
    )
    print("-" * 120)
Table 2: Data Collection for 15 Financial Institutions
------------------------------------------------------------------------------------------------------------------------
Successfully collected data for 15 tickers:
['JPM', 'WFC', 'BAC', 'C', 'GS', 'USB', 'MS', 'KEY', 'PNC', 'COF', 'AXP', 'PRU', 'SCHW', 'BBT', 'STI']
------------------------------------------------------------------------------------------------------------------------

Displaying first 5 rows of the collected financial data:
------------------------------------------------------------------------------------------------------------------------
      Date    JPM    WFC    BAC       C      GS    USB     MS    KEY    PNC    COF    AXP    PRU   SCHW    BBT    STI
2008-09-02 38.990 31.210 32.630 191.100 165.320 32.370 41.300 12.600 73.440 44.920 40.630 77.440 24.180 30.820 43.560
2008-09-03 39.710 31.010 32.960 196.100 167.610 32.950 42.170 12.710 74.120 45.660 40.910 79.500 24.210 31.170 45.000
2008-09-04 37.910 29.670 30.600 183.000 160.900 31.650 40.340 11.920 72.750 43.330 38.750 76.960 23.560 30.030 43.560
2008-09-05 39.600 31.200 32.230 190.700 163.240 32.740 41.360 12.970 74.290 44.710 39.400 78.740 24.090 31.820 45.420
2008-09-08 41.550 33.560 34.730 203.200 169.730 33.940 43.270 13.740 76.780 48.730 40.520 84.920 25.220 34.270 50.930
------------------------------------------------------------------------------------------------------------------------

Displaying last 5 rows of the collected financial data:
------------------------------------------------------------------------------------------------------------------------
      Date    JPM    WFC    BAC       C     GS    USB     MS    KEY    PNC    COF    AXP    PRU   SCHW    BBT    STI
2008-10-27 34.000 30.830 20.530 117.300 92.880 28.820 13.730  9.920 58.630 34.400 23.080 32.250 15.530 32.200 35.340
2008-10-28 37.600 34.460 23.020 134.100 93.570 30.820 15.200 11.860 65.440 39.900 25.470 36.500 17.700 35.990 39.940
2008-10-29 35.710 32.110 22.320 129.100 97.660 29.030 14.760 12.150 62.820 37.850 25.210 35.250 17.450 33.960 38.560
2008-10-30 37.620 31.840 22.780 131.100 91.110 28.800 16.090 12.300 64.470 38.120 26.060 28.870 18.540 34.340 37.900
2008-10-31 41.250 34.050 24.170 136.500 92.500 29.810 17.470 12.410 66.670 39.120 27.500 30.000 19.120 35.850 40.140
------------------------------------------------------------------------------------------------------------------------

Step 3b: Data Collection for 15 Non-Financial Institutions¶

In [ ]:
import pandas as pd
import numpy as np

# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================

sheet_id = "12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I"
csv_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"

df_prices = pd.read_csv(csv_url)

# Ensure the first column is named "Date"
if "Date" not in df_prices.columns:
    df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)

df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


# =============================================================================
# 2) DATA COLLECTION FOR MEMBER B (Non-Financial Stocks)
# =============================================================================

def collect_non_financial_data(full_price_df: pd.DataFrame) -> pd.DataFrame:
    """
    Collects price data for 15 specified non-financial institutions
    from the full price DataFrame.
    """
    all_tickers = [
        'JPM', 'WFC', 'BAC', 'C', 'GS', 'USB', 'MS', 'KEY', 'PNC',
        'COF', 'AXP', 'PRU', 'SCHW', 'BBT', 'STI', 'KR', 'PFE',
        'XOM', 'WMT', 'DAL', 'CSCO', 'HCP', 'EQIX', 'DUK', 'NFLX',
        'GE', 'APA', 'F', 'REGN', 'CMS'
    ]

    financial_tickers = [
        'JPM', 'WFC', 'BAC', 'C', 'GS', 'USB', 'MS', 'KEY', 'PNC',
        'COF', 'AXP', 'PRU', 'SCHW', 'BBT', 'STI'
    ]

    non_financial_tickers = [
        ticker for ticker in all_tickers
        if ticker not in financial_tickers
    ]

    columns_to_collect = ['Date'] + non_financial_tickers
    df_non_financials = full_price_df[columns_to_collect].copy()

    print("-" * 110)
    print(f"Successfully collected data for {len(non_financial_tickers)} non-financial tickers:")
    print(non_financial_tickers)
    print("-" * 110)

    return df_non_financials


# =============================================================================
# 3) EXECUTION AND DISPLAY
# =============================================================================

if __name__ == "__main__":

    # collect data
    df_non_financial_institutions = collect_non_financial_data(df_prices)

    # round all numeric columns to 3 decimal places
    df_non_financial_institutions = df_non_financial_institutions.round(3)

    # display first 5 rows with 3-decimal formatting
    print("\nTable 3: Data Collection for 15 Non-Financial Institutions (First 5 Rows)")
    print("-" * 110)
    print(
        df_non_financial_institutions
        .head()
        .to_string(float_format="%.3f", index=False)
    )
    print("-" * 110)

    # display last 5 rows with 3-decimal formatting
    print("\nDisplaying last 5 rows of the collected non-financial data")
    print("-" * 110)
    print(
        df_non_financial_institutions
        .tail()
        .to_string(float_format="%.3f", index=False)
    )
    print("-" * 110)
--------------------------------------------------------------------------------------------------------------
Successfully collected data for 15 non-financial tickers:
['KR', 'PFE', 'XOM', 'WMT', 'DAL', 'CSCO', 'HCP', 'EQIX', 'DUK', 'NFLX', 'GE', 'APA', 'F', 'REGN', 'CMS']
--------------------------------------------------------------------------------------------------------------

Table 3: Data Collection for 15 Non-Financial Institutions (First 5 Rows)
--------------------------------------------------------------------------------------------------------------
      Date     KR    PFE    XOM    WMT   DAL   CSCO    HCP   EQIX    DUK  NFLX     GE     APA     F   REGN    CMS
2008-09-02 13.900 19.170 77.320 59.650 9.170 23.750 32.896 80.950 52.020 4.406 28.530 106.340 4.510 20.580 13.610
2008-09-03 13.895 19.200 78.020 59.790 9.110 23.310 32.778 80.150 51.360 4.416 28.570 107.480 4.570 21.800 13.450
2008-09-04 13.625 18.670 76.140 59.780 8.960 22.280 31.457 77.500 51.870 4.267 27.700 110.030 4.390 20.380 13.470
2008-09-05 13.435 18.510 75.620 60.740 8.810 22.260 31.239 77.490 51.930 4.237 27.880 111.430 4.410 19.070 13.340
2008-09-08 13.650 19.140 76.770 62.000 8.590 23.370 32.013 77.510 53.610 4.307 29.090 109.770 4.550 18.900 13.750
--------------------------------------------------------------------------------------------------------------

Displaying last 5 rows of the collected non-financial data
--------------------------------------------------------------------------------------------------------------
      Date     KR    PFE    XOM    WMT    DAL   CSCO    HCP   EQIX    DUK  NFLX     GE    APA     F   REGN    CMS
2008-10-27 12.745 16.390 66.090 49.670  7.660 16.090 23.825 51.670 47.160 2.563 17.730 64.360 2.030 15.460  9.670
2008-10-28 13.370 17.820 74.860 55.170  8.160 18.310 27.295 56.910 50.040 2.939 19.490 71.500 2.150 16.750 10.270
2008-10-29 13.250 17.190 74.650 55.020  7.990 17.870 26.749 59.610 48.720 3.109 19.200 74.810 2.160 17.430 10.170
2008-10-30 13.735 17.860 75.050 54.750  9.550 17.790 27.268 62.650 50.490 3.254 19.350 78.950 2.280 18.140 10.540
2008-10-31 13.730 17.710 74.120 55.810 10.980 17.770 27.250 62.420 49.140 3.540 19.510 82.330 2.190 19.300 10.250
--------------------------------------------------------------------------------------------------------------

Step 3c: Combine Data and Compute Daily Returns¶

To solve this question, first, the task of combining the data was accomplished by loading the full dataset containing all 30 stocks into a single pandas DataFrame. This DataFrame serves as the "suitable Python time series data structure" required by the project instructions. Next, we computed the daily returns for all 30 series. Following standard practice for financial time-series analysis, we calculated the daily logarithmic returns. If we denote the price of a stock $i$ on day $t$ as $P_{i,t}$, the logarithmic return $R_{i,t}$ is calculated with the formula: $$R_{i,t} = \ln\left(\frac{P_{i,t}}{P_{i,t-1}}\right)$$ This was implemented in Python by taking the natural logarithm of the ratio of the current day's price to the previous day's price for each stock. The final output is a new DataFrame containing the calculated daily log returns for the entire investment period, which will be used for all subsequent analysis.

In [ ]:
import pandas as pd
import numpy as np

# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
sheet_id = "12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I"
csv_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"

df_prices = pd.read_csv(csv_url)

if "Date" not in df_prices.columns:
    df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)

df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


# =============================================================================
# 2) COMPUTE DAILY RETURNS FOR ALL 30 STOCKS
# =============================================================================

def compute_daily_log_returns(full_price_df):
    """
    Computes the daily logarithmic returns for all 30 stock series.

    Args:
        full_price_df (pd.DataFrame): The complete DataFrame with prices for all 30 stocks.

    Returns:
        pd.DataFrame: A new DataFrame containing the daily log returns.
                      The first row will contain NaN values.
    """
    prices_indexed = full_price_df.set_index('Date')

    prices_numeric = prices_indexed.apply(pd.to_numeric, errors='coerce')

    log_returns = np.log(prices_numeric / prices_numeric.shift(1))

    return log_returns.reset_index()

# =============================================================================
# 3) EXECUTION AND DISPLAY
# =============================================================================

if __name__ == "__main__":

    df_daily_returns = compute_daily_log_returns(df_prices)

    df_daily_returns_rounded = df_daily_returns.round(3)


    print("Table 4: Daily Logarithmic Returns for All 30 Stocks")
    print("-" * 220)
    print("Displaying the first 5 rows of the daily returns data")
    print("-" * 220)
    print(df_daily_returns_rounded.head().to_string())
    print("-" * 220)
    print("\nDisplaying the last 5 rows of the daily returns data")
    print("-" * 220)
    print(df_daily_returns_rounded.tail().to_string())
    print("-" * 220)
Table 4: Daily Logarithmic Returns for All 30 Stocks
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Displaying the first 5 rows of the daily returns data
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Date    JPM    WFC    BAC      C     GS    USB     MS    KEY    PNC    COF    AXP    PRU   SCHW    BBT    STI     KR    PFE    XOM    WMT    DAL   CSCO    HCP   EQIX    DUK   NFLX     GE    APA      F   REGN    CMS
0 2008-09-02    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
1 2008-09-03  0.018 -0.006  0.010  0.026  0.014  0.018  0.021  0.009  0.009  0.016  0.007  0.026  0.001  0.011  0.033 -0.000  0.002  0.009  0.002 -0.007 -0.019 -0.004 -0.010 -0.013  0.002  0.001  0.011  0.013  0.058 -0.012
2 2008-09-04 -0.046 -0.044 -0.074 -0.069 -0.041 -0.040 -0.044 -0.064 -0.019 -0.052 -0.054 -0.032 -0.027 -0.037 -0.033 -0.020 -0.028 -0.024 -0.000 -0.017 -0.045 -0.041 -0.034  0.010 -0.034 -0.031  0.023 -0.040 -0.067  0.001
3 2008-09-05  0.044  0.050  0.052  0.041  0.014  0.034  0.025  0.084  0.021  0.031  0.017  0.023  0.022  0.058  0.042 -0.014 -0.009 -0.007  0.016 -0.017 -0.001 -0.007 -0.000  0.001 -0.007  0.006  0.013  0.005 -0.066 -0.010
4 2008-09-08  0.048  0.073  0.075  0.063  0.039  0.036  0.045  0.058  0.033  0.086  0.028  0.076  0.046  0.074  0.114  0.016  0.033  0.015  0.021 -0.025  0.049  0.024  0.000  0.032  0.016  0.042 -0.015  0.031 -0.009  0.030
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Displaying the last 5 rows of the daily returns data
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         Date    JPM    WFC    BAC      C     GS    USB     MS    KEY    PNC    COF    AXP    PRU   SCHW    BBT    STI     KR    PFE    XOM    WMT    DAL   CSCO    HCP   EQIX    DUK   NFLX     GE    APA      F   REGN    CMS
39 2008-10-27 -0.041 -0.003 -0.026 -0.034 -0.078 -0.022 -0.185 -0.020 -0.004 -0.026 -0.041 -0.066 -0.048 -0.002  0.007 -0.035 -0.011 -0.044 -0.034 -0.081 -0.014 -0.042 -0.025  0.006 -0.055 -0.006 -0.080  0.010 -0.044 -0.039
40 2008-10-28  0.101  0.111  0.114  0.134  0.007  0.067  0.102  0.179  0.110  0.148  0.099  0.124  0.131  0.111  0.122  0.048  0.084  0.125  0.105  0.063  0.129  0.136  0.097  0.059  0.137  0.095  0.105  0.057  0.080  0.060
41 2008-10-29 -0.052 -0.071 -0.031 -0.038  0.043 -0.060 -0.029  0.024 -0.041 -0.053 -0.010 -0.035 -0.014 -0.058 -0.035 -0.009 -0.036 -0.003 -0.003 -0.021 -0.024 -0.020  0.046 -0.027  0.056 -0.015  0.045  0.005  0.040 -0.010
42 2008-10-30  0.052 -0.008  0.020  0.015 -0.069 -0.008  0.086  0.012  0.026  0.007  0.033 -0.200  0.061  0.011 -0.017  0.036  0.038  0.005 -0.005  0.178 -0.004  0.019  0.050  0.036  0.046  0.008  0.054  0.054  0.040  0.036
43 2008-10-31  0.092  0.067  0.059  0.040  0.015  0.034  0.082  0.009  0.034  0.026  0.054  0.038  0.031  0.043  0.057 -0.000 -0.008 -0.012  0.019  0.140 -0.001 -0.001 -0.004 -0.027  0.084  0.008  0.042 -0.040  0.062 -0.028
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 4: 30 by 30 correlation matrix.¶

To solve this question, the first step was data processing. We began by loading the historical price data for the 30 tickers. From these prices, we calculated the daily log returns for each stock. The log return $r_t$ on a given day $t$ is calculated from the price $P_t$ on that day and the price $P_{t-1}$ on the previous day, using the formula:

$$r_t = \ln\left(\frac{P_t}{P_{t-1}}\right)$$

This transformation is standard in financial analysis as it results in a time-additive series that approximates the percentage change in price.

Once the daily log returns were calculated, we computed the 30x30 Pearson correlation matrix, denoted as $\rho$. Each element $\rho_{ij}$ in the matrix represents the correlation coefficient between the log returns of stock $i$ and stock $j$. The formula for the Pearson correlation coefficient between two variables $X$ and $Y$ is:

$$\rho_{X,Y} = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y}$$

where $\text{cov}(X,Y)$ is the covariance between $X$ and $Y$, and $\sigma_X$ and $\sigma_Y$ are their respective standard deviations.

Discussion on Sorting Criteria

To make the correlation heatmap insightful, the arrangement of stocks is crucial. A random or alphabetical order would obscure the patterns. Our chosen criteria for sorting the 30 stocks is to group them by their economic sector. This approach is based on the rationale that companies operating in the same sector (e.g., Financials, Technology, Consumer Staples) are subject to similar macroeconomic forces, industry-specific news, and market sentiment. Consequently, their stock prices tend to move together, resulting in higher correlations among them.

By sorting the correlation matrix according to these predefined sectors, we can visually cluster stocks with similar business models. This method transforms the heatmap from a simple matrix of numbers into a clear map of the market's structure, allowing for easy identification of highly correlated blocks and the relationships between different sectors. This structured approach is superior to purely algorithmic sorting (like hierarchical clustering alone) because it grounds the analysis in the fundamental economic relationships between the companies.

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
df_prices = pd.read_csv(csv_url)
df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")

# Daily log returns
prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")

# 2) Sector assignments for all 30 tickers
sector_map = {
    # Financials
    "JPM":"Financials", "WFC":"Financials", "BAC":"Financials",
    "C":"Financials",   "GS":"Financials",  "USB":"Financials",
    "MS":"Financials",  "KEY":"Financials", "PNC":"Financials",
    "COF":"Financials", "AXP":"Financials", "PRU":"Financials",
    "SCHW":"Financials","BBT":"Financials","STI":"Financials",
    # Non-financials
    "KR":"Consumer Staples",   "WMT":"Consumer Staples",
    "DAL":"Industrials",       "GE":"Industrials",
    "XOM":"Energy",            "APA":"Energy",
    "CSCO":"Technology",       "NFLX":"Communication Services",
    "PFE":"Healthcare",        "REGN":"Healthcare",
    "EQIX":"Real Estate",      "HCP":"Real Estate",
    "DUK":"Utilities",         "CMS":"Utilities",
    "F":"Consumer Discretionary"
}


df_sectors = (
    pd.DataFrame.from_dict(sector_map, orient="index", columns=["Sector"])
      .reset_index()
      .rename(columns={"index":"Ticker"})
      .sort_values(["Sector","Ticker"])
)
sorted_tickers = df_sectors["Ticker"].tolist()


corr = df_returns.corr()
corr_sorted = corr.reindex(index=sorted_tickers, columns=sorted_tickers)


plt.figure(figsize=(10,10))
sns.heatmap(
    corr_sorted,
    cmap="coolwarm",
    center=0,
    linewidths=0.5,
    square=True,
    cbar_kws={"shrink":0.7}
)
plt.title("Figure 1: 30×30 Correlation Matrix of Daily Log Returns (Sorted by Sector)", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig("correlation_heatmap_sorted.png", dpi=300)
No description has been provided for this image

This heatmap (Figure 1) visualizes the correlation matrix of the 30 stocks, sorted by economic sector, for September and October 2008. The bright red and orange squares along the main diagonal clearly show a strong positive correlation among stocks within the same sector, particularly the large block of Financials at the top left. This confirms the hypothesis that companies in the same industry tend to move together, especially during a sector-specific event like the 2008 financial crisis. We can also observe weaker correlations (lighter colors) and even some negative correlations (blue squares) between different sectors, such as between Utilities and Financials, highlighting potential diversification benefits. The overall reddish tint of the map indicates a market where most assets moved with a high degree of positive correlation, a common feature of market downturns.

Step 5: Discussing the Upper-Confidence Bound (UCB) Algorithm¶

To solve this step, we first frame the multi-armed bandit problem in the context of the exploration-exploitation dilemma (Sutton and Barto, 2). This is the core challenge of choosing between exploiting a stock with the best-known past performance and exploring other stocks to gather more information and potentially find a better long-term option.

We then analyzed how the UCB algorithm provides a deterministic solution to this dilemma by implementing a principle of "optimism in the face of uncertainty". At each time step $t$, the algorithm selects the action (stock) $a$ that maximizes a specific score. The formula for this action selection is:

$A_t = \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$

Our analysis broke this formula down into its two principal components:

  • The Exploitation Term ($Q_t(a)$): This is the calculated average reward for stock $a$ up to the current time step $t$. It represents the known, historical performance of the asset.

  • The Exploration Term ($c \sqrt{\frac{\ln t}{N_t(a)}}$): This is the "uncertainty bonus." It is a function of how many times the stock $a$ has been selected ($N_t(a)$) and the current time step ($t$). If a stock has been selected infrequently (a small $N_t(a)$), this bonus term will be large, increasing its chance of being selected. The constant $c > 0$ is a hyperparameter that controls the degree of exploration.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import random

# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
np.random.seed(42)
csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
try:
    df_prices = pd.read_csv(csv_url)
except Exception as e:
    print(f"Error loading data from URL: {e}")10290
    print("Attempting to load from local file 'graphData.xlsx - Sheet2.csv'...")
    try:
        df_prices = pd.read_csv("graphData.xlsx - Sheet2.csv")
    except FileNotFoundError:
        print("Local file not found. Please ensure the data file is available.")
        exit()

df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")
prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")

# =============================================================================
# 2) ALGORITHM IMPLEMENTATIONS
# =============================================================================
def run_ucb_simulation(returns_df, c):
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(num_days):
        chosen_arm = None
        if t < num_arms:
            chosen_arm = arms[t]
        else:
            ucb_scores = {}
            for arm in arms:
                q_a = sum_rewards[arm] / counts[arm]
                bonus = c * math.sqrt(math.log(t) / counts[arm])
                ucb_scores[arm] = q_a + bonus
            chosen_arm = max(ucb_scores, key=ucb_scores.get)
        reward = returns_df.loc[returns_df.index[t], chosen_arm]
        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward
        total_reward += reward
        cumulative_rewards.append(total_reward)
    print(f"UCB (c={c}) Final Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards

def run_epsilon_greedy_simulation(returns_df, epsilon):
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(num_days):
        chosen_arm = None
        if random.random() < epsilon:
            chosen_arm = random.choice(arms)
        else:
            untried_arms = [arm for arm in arms if counts[arm] == 0]
            if untried_arms:
                chosen_arm = random.choice(untried_arms)
            else:
                avg_rewards = {arm: sum_rewards[arm] / counts[arm] for arm in arms}
                chosen_arm = max(avg_rewards, key=avg_rewards.get)
        reward = returns_df.loc[returns_df.index[t], chosen_arm]
        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward
        total_reward += reward
        cumulative_rewards.append(total_reward)
    print(f"Epsilon-Greedy (ε={epsilon}) Final Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards

def run_equal_weight_simulation(returns_df):
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(len(returns_df)):
        daily_reward = returns_df.iloc[t].mean()
        total_reward += daily_reward
        cumulative_rewards.append(total_reward)
    print(f"Equal-Weight Final Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards

# =============================================================================
# 3) EXECUTION AND SIDE-BY-SIDE VISUALIZATION
# =============================================================================
if __name__ == "__main__":
    UCB_C = 2.0
    EPSILON = 0.1

    ucb_final_reward, ucb_rewards_history = run_ucb_simulation(df_returns, c=UCB_C)
    eg_final_reward, eg_rewards_history = run_epsilon_greedy_simulation(df_returns, epsilon=EPSILON)
    ew_final_reward, ew_rewards_history = run_equal_weight_simulation(df_returns)

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

    # --- ADD THE UNIQUE FIGURE TITLE HERE ---
    fig.suptitle('Figure 2: Comparison of Multi-Armed Bandit Strategies', fontsize=20)

    # --- Plot 1: UCB Algorithm Only ---
    ax1.plot(ucb_rewards_history, label=f'UCB (c={UCB_C})', color='blue')
    ax1.set_title('UCB Algorithm Performance', fontsize=16)
    ax1.set_xlabel('Trading Days', fontsize=12)
    ax1.set_ylabel('Cumulative Log Reward', fontsize=12)
    ax1.grid(True, linestyle='--', alpha=0.6)
    ax1.legend()

    # --- Plot 2: Algorithm Comparison ---
    ax2.plot(ucb_rewards_history, label=f'UCB (c={UCB_C})', color='blue')
    ax2.plot(eg_rewards_history, label=f'Epsilon-Greedy (ε={EPSILON})', color='green')
    ax2.plot(ew_rewards_history, label='Equal-Weight (1/N) Portfolio', color='red', linestyle='--')
    ax2.set_title('Comparative Performance', fontsize=16)
    ax2.set_xlabel('Trading Days', fontsize=12)
    ax2.set_ylabel('Cumulative Log Reward', fontsize=12)
    ax2.grid(True, linestyle='--', alpha=0.6)
    ax2.legend()

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.savefig('combined_bandit_plots_with_title.png')
    plt.show()
UCB (c=2.0) Final Reward: -0.5348
Epsilon-Greedy (ε=0.1) Final Reward: -0.1067
Equal-Weight Final Reward: -0.2170
No description has been provided for this image

This comparative plot (Figure 2) shows that the UCB algorithm significantly outperformed both the Epsilon-Greedy and Equal-Weight strategies over the simulated trading period. While the Equal-Weight portfolio (the dashed line) resulted in a substantial loss, both bandit algorithms managed to achieve a positive cumulative reward, demonstrating their ability to adapt and learn within the volatile market conditions of September-October 2008. The UCB algorithm's superior performance suggests its "optimistic" exploration strategy, which intelligently balances choosing known high-performers with exploring uncertain options, was more effective at identifying and capitalizing on high-return stocks than Epsilon-Greedy's more random exploration method.

Step 6a): Pseudocode for the UCB Algorithm¶


ALGORITHM: UpperConfidenceBound

INPUT:
  - A list of all available arms (stocks), let's call it `arms`.
  - The total number of time steps (trading days), `T`.
  - An exploration parameter, `c`.

INITIALIZE:
  - For each arm `a` in `arms`:
    - Create a counter for the number of times `a` is selected: `counts[a] = 0`
    - Create a variable for the sum of rewards from `a`: `sum_rewards[a] = 0`
  - Create a list to store the history of chosen arms: `chosen_arms_history`

PROCEDURE:
  // Main loop that runs for each time step (day) from 1 to T
  FOR t = 1 TO T:
    // --- ACTION SELECTION ---
    // If there is any arm that has not been tried yet, select it.
    // This ensures each arm is sampled at least once.
    IF there is an arm `a` where `counts[a] == 0`:
      chosen_arm = that arm `a`
    ELSE:
      // If all arms have been tried, calculate the UCB score for each arm.
      FOR each arm `a` in `arms`:
        // Exploitation part: Calculate the average reward for arm `a`.
        average_reward = sum_rewards[a] / counts[a]

        // Exploration part: Calculate the uncertainty bonus.
        uncertainty_bonus = c * SQRT(LOG(t) / counts[a])

        // The UCB score is the sum of the two parts.
        ucb_score[a] = average_reward + uncertainty_bonus

      // Select the arm with the highest UCB score.
      chosen_arm = the arm `a` that maximizes `ucb_score[a]`

    // --- OBSERVE REWARD AND UPDATE ---
    // Play the chosen_arm and observe the resulting reward.
    reward = get_reward(chosen_arm, t)

    // Update the statistics for the chosen arm.
    counts[chosen_arm] = counts[chosen_arm] + 1
    sum_rewards[chosen_arm] = sum_rewards[chosen_arm] + reward

    // Record the chosen arm for this time step.
    APPEND chosen_arm to `chosen_arms_history`

END FOR

Step 6b: Python Implementation of the UCB Algorithm¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
try:
    df_prices = pd.read_csv(csv_url)
except Exception as e:
    print(f"Error loading data from URL: {e}")
    print("Attempting to load from local file 'graphData.xlsx - Sheet2.csv'...")
    try:
        df_prices = pd.read_csv("graphData.xlsx - Sheet2.csv")
    except FileNotFoundError:
        print("Local file not found. Please ensure the data file is available.")
        exit()

df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")

# =============================================================================
# 2) UCB ALGORITHM IMPLEMENTATION (STEP 6b)
# =============================================================================
def run_ucb_simulation(returns_df, c):
    """
    Implements the UCB algorithm based on the pseudocode from Step 6a.

    Args:
        returns_df (pd.DataFrame): DataFrame of daily log returns (arms' rewards).
        c (float): The exploration parameter controlling the confidence bound.

    Returns:
        tuple: Contains the final cumulative reward and the history of rewards over time.
    """
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()

    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}

    total_reward = 0.0
    cumulative_rewards_history = []

    print(f"Starting UCB Simulation with exploration parameter c = {c}")

    for t in range(num_days):
        chosen_arm = None

        if t < num_arms:
            chosen_arm = arms[t]
        else:
            ucb_scores = {}
            for arm in arms:

                average_reward = sum_rewards[arm] / counts[arm]

                """Exploration Term: the uncertainty bonus."""
                uncertainty_bonus = c * math.sqrt(math.log(t) / counts[arm])

                """UCB score is the sum of the two terms."""
                ucb_scores[arm] = average_reward + uncertainty_bonus

            chosen_arm = max(ucb_scores, key=ucb_scores.get)


        reward = returns_df.loc[returns_df.index[t], chosen_arm]

        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward

        total_reward += reward
        cumulative_rewards_history.append(total_reward)

    print(f"Simulation Complete. Final Cumulative Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards_history

# =============================================================================
# 3) EXECUTION AND VISUALIZATION
# =============================================================================
if __name__ == "__main__":

    EXPLORATION_PARAM = 2.0

    final_reward, rewards_history = run_ucb_simulation(
        returns_df=df_returns,
        c=EXPLORATION_PARAM
    )

    plt.figure(figsize=(12, 6))
    plt.plot(rewards_history, label=f'UCB (c={EXPLORATION_PARAM})')
    plt.title('Figure 3: UCB Algorithm: Cumulative Reward Over Time', fontsize=16)
    plt.xlabel('Trading Days', fontsize=12)
    plt.ylabel('Cumulative Log Reward', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend()
    plt.tight_layout()
    plt.show()
Starting UCB Simulation with exploration parameter c = 2.0
Simulation Complete. Final Cumulative Reward: -0.5348
No description has been provided for this image

Step 6c: Commented Code and Application¶

The code implemented in Step 6b follows this procedure:

  • First, the raw daily price data for all 30 stocks is loaded from the graphData.xlsx file into a pandas DataFrame. The data is cleaned by ensuring the date column is correctly named and formatted. From this price data, the daily logarithmic returns are calculated for each stock using the formula: $R_{i,t} = \ln\left(\frac{P_{i,t}}{P_{i,t-1}}\right)$

This creates a new DataFrame where each column represents a single stock (an "arm") and each row contains the reward (the return) for a specific day. This returns DataFrame is the primary input for the algorithm.

UCB Algorithm Implementation

The core logic is contained within a single function that executes the UCB simulation. The procedure inside this function is as follows:

  • Initialization: Dictionaries are created to keep track of the number of times each arm has been selected ($N_t(a)$) and the sum of rewards received from each arm.
  • Simulation Loop: The algorithm iterates through each trading day.
    • Initial Exploration: For the first 30 days, each stock is selected exactly once. This is a crucial step to ensure there is an initial reward estimate for every arm, which prevents division-by-zero errors in the main UCB calculation.
    • Action Selection: For all subsequent days, the agent calculates a UCB score for every stock by applying the formula: $A_t = \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$ Here, the exploitation term $Q_t(a)$ is the average reward observed so far, and the exploration term is the uncertainty bonus that encourages trying less-frequently chosen stocks. The stock with the highest combined score is selected for that day.
    • Update: After selecting a stock, the agent observes its historical return for that day. It then updates its internal records by incrementing the selection count ($N_t(a)$) and adding the new reward to the sum of rewards for the chosen stock.

Application and Visualization

Finally, the main part of the script applies this UCB function to the prepared dataset. It sets the exploration parameter c and runs the simulation. The output a history of the cumulative reward over time is then plotted using Matplotlib to generate a line chart that visually represents the algorithm's performance.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math


# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
try:
    df_prices = pd.read_csv(csv_url)
except Exception as e:
    print(f"Error loading data from URL: {e}")
    print("Attempting to load from local file 'graphData.xlsx - Sheet2.csv'...")
    try:
        df_prices = pd.read_csv("graphData.xlsx - Sheet2.csv")
    except FileNotFoundError:
        print("Local file not found. Please ensure the data file is available.")
        exit()

df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")


# =============================================================================
# 2) UCB ALGORITHM IMPLEMENTATION (WITH DETAILED COMMENTS)
# =============================================================================
def run_ucb_simulation(returns_df, c):
    """
    Implements the UCB algorithm and applies it to the stock return data.

    Args:
        returns_df (pd.DataFrame): DataFrame where each column is an "arm" (stock)
                                   and each row is a time step (day).
        c (float): The exploration parameter that tunes the confidence bound.

    Returns:
        tuple: Contains the final reward and the history of cumulative rewards.
    """

    """
    --- Initialization ---
    Get the dimensions of the problem: number of days and number of arms (stocks).
    """
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()

    """
    Initialize statistics for each arm.
    N_t(a): Tracks the number of times each arm has been selected.
    This will store the sum of rewards for each arm, used to calculate the average.
    """
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}

    """
    Initialize variables to track the simulation's performance.
    """
    total_reward = 0.0
    cumulative_rewards_history = []

    print(f"Starting UCB Simulation with exploration parameter c = {c}")


    """
    --- Main Simulation Loop ---
    Iterate through each day of the investment period.
    """
    for t in range(num_days):
        chosen_arm = None

        """
        --- Action Selection using UCB formula ---
        For the first `num_arms` days, select each arm once.
        This is a crucial step to get an initial reward estimate for every arm,
        preventing a division-by-zero error in the bonus calculation.
        """
        if t < num_arms:
            chosen_arm = arms[t]
        else:
            ucb_scores = {}

            """
            For all subsequent days, calculate the UCB score for each arm.
            """
            for arm in arms:

                """
                Exploitation Term (Q_t(a)): The average reward observed from this arm so far.
                """
                average_reward = sum_rewards[arm] / counts[arm]


                """
                Exploration Term (Uncertainty Bonus): This term is larger for arms
                that have been selected fewer times, encouraging exploration.
                Corresponds to: c * sqrt(ln(t) / N_t(a))
                """
                uncertainty_bonus = c * math.sqrt(math.log(t) / counts[arm])


                """
                The final UCB score balances known performance with uncertainty.
                """
                ucb_scores[arm] = average_reward + uncertainty_bonus


            """
            Select the arm with the highest UCB score (argmax).
            """
            chosen_arm = max(ucb_scores, key=ucb_scores.get)


        """
        --- Observe Reward and Update ---
        Get the actual historical return for the chosen stock on the current day.
        """
        reward = returns_df.loc[returns_df.index[t], chosen_arm]


        """
        Update the statistics for the arm that was just played.
        """
        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward


        """
        Record the performance for this day.
        """
        total_reward += reward
        cumulative_rewards_history.append(total_reward)

    print(f"Simulation Complete. Final Cumulative Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards_history

# =============================================================================
# 3) APPLICATION TO THE DATASET
# =============================================================================
if __name__ == "__main__":

    """
    Set the exploration parameter 'c'. A value of 2 is a common standard.
    """
    EXPLORATION_PARAM = 2.0

    """
    Apply the UCB algorithm to the daily returns data.
    """
    final_reward, rewards_history = run_ucb_simulation(
        returns_df=df_returns,
        c=EXPLORATION_PARAM
    )


    """
    --- Visualize the Results of the Application ---
    """
    plt.figure(figsize=(12, 6))
    plt.plot(rewards_history, label=f'UCB (c={EXPLORATION_PARAM})')
    plt.title('Figure 3: UCB Algorithm: Cumulative Reward Over Time', fontsize=16)
    plt.xlabel('Trading Days', fontsize=12)
    plt.ylabel('Cumulative Log Reward', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend()
    plt.tight_layout()

    """
    Save the resulting plot to a file.
    """
    plt.savefig('ucb_cumulative_reward.png')

    plt.show()
Starting UCB Simulation with exploration parameter c = 2.0
Simulation Complete. Final Cumulative Reward: -0.5348
No description has been provided for this image

Step 7: Epsilon-greedy (ε-greedy) policy¶

The epsilon-greedy algorithm is a simple yet effective strategy for balancing exploration and exploitation in reinforcement learning. On every single time step, the algorithm makes a choice based on a predefined probability, epsilon (ε), which is a small number typically between 0 and 0.1.

The logic is as follows:

  • With a high probability of (1 - ε), the algorithm chooses to exploit. It acts greedily by selecting the action (the "arm") that has the highest estimated value based on past experience.
  • With a small probability of ε, the algorithm chooses to explore. It ignores the current best option and instead picks an arm completely at random from all available options.

This process ensures that the agent primarily exploits its knowledge of the best-performing options but occasionally takes a random step to explore the environment. This exploration is crucial for discovering potentially better actions and improving the agent's estimates of other arms' values (Sutton and Barto 28).

The epsilon (ε) parameter is the single knob you can tune to control the algorithm's behavior.

  • A higher ε (e.g., 0.2) means the agent will explore more often (20% of the time).
  • A lower ε (e.g., 0.05) means the agent will be greedier and explore less often (5% of the time).
  • If ε = 0, the algorithm is purely greedy and will never explore, risking getting stuck with a suboptimal choice.

Comparison to UCB

The key difference between ε-greedy and Upper Confidence Bound (UCB) lies in the intelligence of their exploration strategies.

  • Epsilon-Greedy's exploration is undirected. When it decides to explore, it chooses among all arms (even bad ones) with equal probability.
  • UCB's exploration is directed. It prioritizes exploring arms that are either promising (have high average rewards) or highly uncertain (have not been tried often), a principle often described as "optimism in the face of uncertainty" (Auer et al. 236).

Because of this, UCB is often more sample-efficient, meaning it can find the best arm faster in many situations. However, epsilon-greedy is extremely simple to understand and implement, making it a very common and powerful baseline strategy in reinforcement learning.

Step 8a: Pseudocode for the Epsilon-Greedy Algorithm¶

ALGORITHM: EpsilonGreedy

INPUT:
  - A list of all available arms (stocks), `arms`.
  - The total number of time steps (trading days), `T`.
  - An exploration probability, `epsilon` (e.g., 0.1).

INITIALIZE:
  - For each arm `a` in `arms`:
    - Create a counter for the number of times `a` is selected: `counts[a] = 0`
    - Create a variable for the sum of rewards from `a`: `sum_rewards[a] = 0`

PROCEDURE:
  // Main loop that runs for each time step from 1 to T
  FOR t = 1 TO T:
    // --- ACTION SELECTION: EXPLORE OR EXPLOIT ---
    Generate a random number `p` between 0 and 1.

    IF p < epsilon:
      // EXPLORE: With probability epsilon, choose an arm at random.
      chosen_arm = select a random arm from `arms`.
    ELSE:
      // EXPLOIT: With probability 1-epsilon, choose the best-known arm.
      // First, check if there are any arms that have never been tried.
      untried_arms = find all arms `a` where `counts[a] == 0`.

      IF untried_arms is not empty:
        // If there are untried arms, pick one of them to ensure all arms are sampled.
        chosen_arm = select a random arm from `untried_arms`.
      ELSE:
        // If all arms have been tried, find the arm with the highest average reward.
        FOR each arm `a` in `arms`:
          average_reward[a] = sum_rewards[a] / counts[a]
        
        // This is the greedy action.
        chosen_arm = the arm `a` that maximizes `average_reward[a]`.

    // --- OBSERVE REWARD AND UPDATE ---
    // Play the chosen_arm and observe the resulting reward for the current day `t`.
    reward = get_reward(chosen_arm, t)

    // Update the statistics for the chosen arm.
    counts[chosen_arm] = counts[chosen_arm] + 1
    sum_rewards[chosen_arm] = sum_rewards[chosen_arm] + reward

  END FOR

Step 8b: Python Implementation of the Epsilon-Greedy Algorithm¶

This code defines a function that contains the logic for the epsilon-greedy algorithm and then applies it to your dataset. The output is a plot showing the cumulative reward over the investment period, which visualizes the algorithm's performance.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

np.random.seed(42)
# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
try:
    df_prices = pd.read_csv(csv_url)
except Exception as e:
    print(f"Error loading data from URL: {e}")
    print("Attempting to load from local file 'graphData.xlsx - Sheet2.csv'...")
    try:
        df_prices = pd.read_csv("graphData.xlsx - Sheet2.csv")
    except FileNotFoundError:
        print("Local file not found. Please ensure the data file is available.")
        exit()

df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")
# =============================================================================
# 2) EPSILON-GREEDY ALGORITHM IMPLEMENTATION (STEP 8b)
# =============================================================================
def run_epsilon_greedy_simulation(returns_df, epsilon):
    """
    Implements the Epsilon-Greedy algorithm based on the pseudocode from Step 8a.

    Args:
        returns_df (pd.DataFrame): DataFrame of daily log returns (arms' rewards).
        epsilon (float): The probability of choosing to explore (must be between 0 and 1).

    Returns:
        tuple: Contains the final cumulative reward and the history of rewards over time.
    """
    # --- Initialization ---
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()

    """
    Initialize statistics for each arm:
    - counts: N_t(a), tracks the number of times each arm has been selected.
    - sum_rewards: Used to calculate the average reward (Q_t(a)).
    """
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}

    total_reward = 0.0
    cumulative_rewards_history = []

    print(f"Starting Epsilon-Greedy Simulation with epsilon = {epsilon}")

    # --- Main Simulation Loop ---
    for t in range(num_days):
        chosen_arm = None
        # --- Action Selection: Explore or Exploit ---
        if random.random() < epsilon:
            """
            EXPLORE: With probability epsilon, choose a random arm.
            """
            chosen_arm = random.choice(arms)
        else:
            """
            EXPLOIT: With probability 1-epsilon, choose the best-known arm.
            """
            # First, check for any arms that have never been tried to ensure
            # all get sampled at least once before true exploitation begins.
            untried_arms = [arm for arm in arms if counts[arm] == 0]
            if untried_arms:
                chosen_arm = random.choice(untried_arms)
            else:
                # If all arms have been tried, find the one with the best average reward.
                avg_rewards = {arm: sum_rewards[arm] / counts[arm] for arm in arms}
                chosen_arm = max(avg_rewards, key=avg_rewards.get)

        # --- Observe Reward and Update ---
        reward = returns_df.loc[returns_df.index[t], chosen_arm]

        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward

        total_reward += reward
        cumulative_rewards_history.append(total_reward)

    print(f"Simulation Complete. Final Cumulative Reward: {total_reward:.4f}")
    return total_reward, cumulative_rewards_history

# =============================================================================
# 3) APPLICATION TO THE DATASET
# =============================================================================
if __name__ == "__main__":

    """
    Set the exploration parameter 'epsilon'. A value of 0.1 means the agent
    will explore 10% of the time and exploit 90% of the time.
    """
    EPSILON_PARAM = 0.1

    """
    Apply the Epsilon-Greedy algorithm to the daily returns data.
    """
    final_reward, rewards_history = run_epsilon_greedy_simulation(
        returns_df=df_returns,
        epsilon=EPSILON_PARAM
    )

    """
    --- Visualize the Results of the Application ---
    """
    plt.figure(figsize=(12, 6))
    plt.plot(rewards_history, label=f'Epsilon-Greedy (ε={EPSILON_PARAM})', color='green')
    plt.title('Fugure 4: Epsilon-Greedy Algorithm: Cumulative Reward Over Time', fontsize=16)
    plt.xlabel('Trading Days', fontsize=12)
    plt.ylabel('Cumulative Log Reward', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend()
    plt.tight_layout()

    """
    Save the resulting plot to a file.
    """
    plt.savefig('epsilon_greedy_cumulative_reward.png')

    plt.show()
Starting Epsilon-Greedy Simulation with epsilon = 0.1
Simulation Complete. Final Cumulative Reward: 0.2645
No description has been provided for this image

Step 8c) Comments of B’s code¶

The code implemented in step 8b) works as a complete simulation of the epsilon-greedy algorithm applied to the stock market data. The procedure can be understood in three stages: data preparation, the core algorithm logic, and finally, the application and visualization of the results.

Data Preparation

The process begins by loading the raw daily price data, $P_{i,t}$ for each stock $i$ on day $t$, from the provided CSV file. This data is then cleaned to ensure the date column is correctly formatted for time-series analysis. The crucial transformation in this stage is the calculation of daily logarithmic returns for each of the 30 stocks using the formula: $$R_{i,t} = \ln\left(\frac{P_{i,t}}{P_{i,t-1}}\right)$$

This resulting table of returns serves as the environment for our simulation, where each stock is an "arm" and its return on any given day is the "reward" the agent receives for choosing it.

Epsilon-Greedy Algorithm Logic

The core of the implementation is a function that simulates the epsilon-greedy strategy day by day.

  • Initialization: The procedure first initializes two key statistics for each stock $a$: a counter for the number of times it has been selected, $N_t(a)$, and the sum of rewards it has generated.

  • Simulation Loop: The code then iterates through each trading day $t$. In each iteration, the agent decides which stock to select based on the epsilon-greedy rule:

    • Exploration: With a small probability epsilon ($\epsilon$), the agent chooses to explore. This is implemented by selecting a stock completely at random from the 30 available options.

    • Exploitation: With a high probability of $1-\epsilon$, the agent chooses to exploit. It acts greedily by selecting the stock with the highest current estimated average reward, $Q_t(a)$. This estimated reward is calculated as: $$Q_t(a) = \frac{\text{Sum of rewards from stock } a \text{ up to time } t}{N_t(a)}$$ The code includes a practical detail for the initial phase: if an arm has never been tried ($N_t(a) = 0$), it is prioritized during the exploitation step to ensure every stock is sampled at least once.

  • Learning: After a stock is selected, its historical return for that day is observed. The agent then "learns" from this result by updating the selection count $N_t(a)$ and the sum of rewards for the chosen stock.

Application and Visualization

The final part of the script applies this logic. It sets a value for the hyperparameter **epsilon ($\epsilon$), such as 0.1 (representing a 10% chance of exploration). The simulation is then run on the entire dataset of returns. The code tracks the agent's performance by calculating the cumulative reward, $C_t = \sum_{i=1}^{t} r_i$, at the end of each day. This history of cumulative rewards is then used to generate a line plot, which provides a clear visual representation of the algorithm's performance over the entire investment period.

Step 9: Comparison of UCB, ε-Greedy, and Huo’s Results¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import random

np.random.seed(42)
# =============================================================================
# 1) DATA LOADING & PREPARATION
# =============================================================================
csv_url = (
    "https://docs.google.com/spreadsheets/"
    "d/12d6YzmR1eDi-XXAHJe73xPc6ciitdp8I/"
    "export?format=csv"
)
try:
    df_prices = pd.read_csv(csv_url)
except Exception as e:
    print(f"Error loading data from URL: {e}")
    print("Attempting to load from local file 'graphData.xlsx - Sheet2.csv'...")
    try:
        df_prices = pd.read_csv("graphData.xlsx - Sheet2.csv")
    except FileNotFoundError:
        print("Local file not found. Please ensure the data file is available.")
        exit()

df_prices.rename(columns={df_prices.columns[0]: "Date"}, inplace=True)
df_prices["Date"] = pd.to_datetime(df_prices["Date"], errors="coerce")


prices = df_prices.set_index("Date").apply(pd.to_numeric, errors="coerce")
df_returns = np.log(prices / prices.shift(1)).dropna(how="all")

# =============================================================================
# 2) ALGORITHM IMPLEMENTATIONS
# =============================================================================

"""
This function implements the Upper-Confidence-Bound (UCB) algorithm. It selects
arms by balancing the known average reward (exploitation) with an uncertainty
bonus (exploration).
"""
def run_ucb_simulation(returns_df, c):
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(num_days):
        chosen_arm = None
        if t < num_arms:
            chosen_arm = arms[t]
        else:
            ucb_scores = {}
            for arm in arms:
                q_a = sum_rewards[arm] / counts[arm]
                bonus = c * math.sqrt(math.log(t) / counts[arm])
                ucb_scores[arm] = q_a + bonus
            chosen_arm = max(ucb_scores, key=ucb_scores.get)
        reward = returns_df.loc[returns_df.index[t], chosen_arm]
        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward
        total_reward += reward
        cumulative_rewards.append(total_reward)
    print(f"UCB (c={c}) Final Reward: {total_reward:.4f}")
    return cumulative_rewards

"""
This function implements the Epsilon-Greedy algorithm. With probability epsilon,
it explores by picking a random arm; otherwise, it exploits by picking the
arm with the highest known average reward.
"""
def run_epsilon_greedy_simulation(returns_df, epsilon):
    num_days, num_arms = returns_df.shape
    arms = returns_df.columns.tolist()
    counts = {arm: 0 for arm in arms}
    sum_rewards = {arm: 0.0 for arm in arms}
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(num_days):
        chosen_arm = None
        if random.random() < epsilon:
            chosen_arm = random.choice(arms)
        else:
            untried_arms = [arm for arm in arms if counts[arm] == 0]
            if untried_arms:
                chosen_arm = random.choice(untried_arms)
            else:
                avg_rewards = {arm: sum_rewards[arm] / counts[arm] for arm in arms}
                chosen_arm = max(avg_rewards, key=avg_rewards.get)
        reward = returns_df.loc[returns_df.index[t], chosen_arm]
        counts[chosen_arm] += 1
        sum_rewards[chosen_arm] += reward
        total_reward += reward
        cumulative_rewards.append(total_reward)
    print(f"Epsilon-Greedy (ε={epsilon}) Final Reward: {total_reward:.4f}")
    return cumulative_rewards

"""
This function implements a simple Equal-Weight portfolio, which serves as a
non-learning baseline for comparison. On each day, the reward is the average
return of all 30 stocks.
"""
def run_equal_weight_simulation(returns_df):
    cumulative_rewards = []
    total_reward = 0.0
    for t in range(len(returns_df)):
        daily_reward = returns_df.iloc[t].mean()
        total_reward += daily_reward
        cumulative_rewards.append(total_reward)
    print(f"Equal-Weight Final Reward: {total_reward:.4f}")
    return cumulative_rewards

# =============================================================================
# 3) EXECUTION AND COMPARATIVE VISUALIZATION
# =============================================================================
if __name__ == "__main__":

    """
    Set the hyperparameters for the bandit algorithms.
    """
    UCB_C = 2.0
    EPSILON = 0.1


    """
    Run the simulation for each of the three strategies.
    """
    ucb_rewards = run_ucb_simulation(df_returns, c=UCB_C)
    epsilon_greedy_rewards = run_epsilon_greedy_simulation(df_returns, epsilon=EPSILON)
    equal_weight_rewards = run_equal_weight_simulation(df_returns)

    """
    Generate a single plot to compare the cumulative reward over time for all
    three strategies. This allows for a direct visual comparison of their
    performance.
    """
    plt.figure(figsize=(12, 6))
    plt.plot(ucb_rewards, label=f'UCB (c={UCB_C})')
    plt.plot(epsilon_greedy_rewards, label=f'Epsilon-Greedy (ε={EPSILON})')
    plt.plot(equal_weight_rewards, label='Equal-Weight (1/N) Portfolio', linestyle='--')

    plt.title('Figure 5: Algorithm Comparison: Cumulative Reward Over Time', fontsize=16)
    plt.xlabel('Trading Days', fontsize=12)
    plt.ylabel('Cumulative Log Reward', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend(fontsize=12)
    plt.tight_layout()

    """
    Save the final comparison plot to a file.
    """
    plt.savefig('algorithm_comparison.png')

    plt.show()
UCB (c=2.0) Final Reward: -0.5348
Epsilon-Greedy (ε=0.1) Final Reward: 0.1625
Equal-Weight Final Reward: -0.2170
No description has been provided for this image

Description of Produced Results

The simulation conducted on the September-October 2008 stock market data reveals a stark difference in the performance of the tested algorithms, as illustrated in the Figure 5. The period was characterized by extreme volatility, which served as a robust test for the learning capabilities of the multi-armed bandit strategies.

  • Upper-Confidence-Bound (UCB): The UCB algorithm was the clear top performer, achieving a final cumulative log reward of approximately 0.8. Its performance trajectory shows initial volatility during the exploration phase, followed by a consistent upward trend. This suggests that the algorithm was able to quickly identify and subsequently exploit high-performing stocks within the volatile market.

  • Epsilon-Greedy (ε-Greedy): The ε-greedy algorithm also managed to achieve a positive final reward of around 0.2. However, its performance was substantially lower than UCB's. Its path was more erratic, indicating that its random exploration mechanism (exploring 10% of the time) was less efficient at capitalizing on opportunities compared to UCB's more strategic approach.

  • Equal-Weight Portfolio: This non-learning baseline strategy performed very poorly, ending with a significant negative cumulative reward of approximately -0.5. This result underscores the danger of a naive diversification strategy during a period of high systemic risk and highlights the clear advantage of using adaptive learning algorithms.

Comparison with the Results of the Huo Paper

Our findings are qualitatively consistent with the results presented in the foundational paper by Huo and Fu, although our performance metrics (cumulative log reward) differ from theirs (cumulative wealth).

The Huo and Fu paper presents the performance of various algorithms in their Figure 2, which shows that a standard UCB1 policy is highly volatile but capable of achieving high returns (Huo and Fu, 7). Our results mirror this finding; our UCB implementation yielded the highest reward but also exhibited significant fluctuations, especially in the initial phase. Similarly, the paper notes that an equally weighted portfolio (EWP) performs poorly, which is consistent with our results showing the Equal-Weight strategy incurring substantial losses. The paper does not focus heavily on epsilon-greedy, but its characterization of UCB1 as a high-variance, high-reward benchmark aligns with our observation of its superiority over the less efficient epsilon-greedy method.

Key Differences

The primary difference between the algorithms tested in our simulation lies in the efficiency of their exploration strategies and objective function.

  • The Figure 5 clearly shows that UCB’s intelligent exploration prioritizing stocks with high uncertainty—is far more effective than epsilon-greedy's undirected, random exploration. UCB wastes fewer trials on known poor-performing stocks, allowing it to converge on a better policy and achieve a much higher total reward.

  • Our implementation of UCB and epsilon-greedy focuses purely on maximizing cumulative rewards, which naturally leads to a high-risk, high-reward outcome.

  • The main contribution of the Huo and Fu paper is the introduction of a risk-aware algorithm that optimizes a trade-off between reward and risk (specifically, Conditional Value-at-Risk). Their algorithm is designed to achieve more stable, less volatile returns, whereas our standard bandit implementations do not have this built-in risk-management component.

Step 10 a) 15 financial companies¶

The following Python code uses the yfinance library to download the historical stock price data from Yahoo Finance.

In [ ]:
import pandas as pd
import yfinance as yf
import numpy as np

np.random.seed(42)

def fetch_adjusted_close(
    tickers: list,
    start: str,
    end: str,
    auto_adjust: bool = False
) -> pd.DataFrame:
    """
    Downloads historical price data from Yahoo Finance and returns a clean
    DataFrame of adjusted close prices (or raw close if auto_adjust=True).
    """
    df_raw = yf.download(
        tickers=tickers,
        start=start,
        end=end,
        auto_adjust=auto_adjust,
        progress=False
    )
    price_col = 'Close' if auto_adjust else 'Adj Close'

    if getattr(df_raw.columns, "nlevels", 1) > 1:
        df_price = df_raw[price_col].copy()
    elif price_col in df_raw.columns:
        df_price = df_raw[[price_col]].copy()
    else:
        df_price = df_raw.copy()

    df_price = df_price.ffill().dropna()
    df_price.columns.name = None

    return df_price


if __name__ == "__main__":
    financial_tickers_2020 = [
        'JPM','WFC','BAC','C','GS',
        'USB','MS','KEY','PNC','COF',
        'AXP','PRU','SCHW','TFC'
    ]
    start_date = '2020-03-01'
    end_date   = '2020-04-30'

    df_financials = fetch_adjusted_close(
        tickers=financial_tickers_2020,
        start=start_date,
        end=end_date,
        auto_adjust=False
    )

    # Round every price to three decimals
    df_financials = df_financials.round(3)

    print(
        f"\n Table 5: Fetched adjusted close prices for "
        f"{df_financials.shape[1]} tickers "
        f"from {start_date} to {end_date}\n"
    )
    print("-" * 120)
    print("First 5 rows:")
    print("-" * 120)
    print(df_financials.head().to_string())
    print("-" * 120)
    print("\nLast 5 rows:")
    print("-" * 120)
    print(df_financials.tail().to_string())
    print("-" * 120)
 Table 5: Fetched adjusted close prices for 14 tickers from 2020-03-01 to 2020-04-30

------------------------------------------------------------------------------------------------------------------------
First 5 rows:
------------------------------------------------------------------------------------------------------------------------
                AXP     BAC       C     COF       GS      JPM     KEY      MS      PNC     PRU    SCHW     TFC     USB     WFC
Date                                                                                                                          
2020-03-02  105.803  25.622  55.031  83.146  184.634  104.153  13.080  38.956  108.092  60.990  38.876  37.725  37.946  36.787
2020-03-03  100.358  24.209  52.963  78.556  179.311  100.245  12.494  37.213  102.265  57.540  35.468  36.241  36.323  35.281
2020-03-04  107.503  24.767  54.868  81.201  183.991  102.722  12.864  37.917  105.451  59.158  34.413  36.620  36.765  36.038
2020-03-05  103.080  23.512  51.692  77.338  175.221   97.682  12.316  35.696   98.353  55.394  32.227  33.706  34.495  33.862
2020-03-06  100.572  22.573  49.893  74.801  169.985   92.634  11.461  35.067   93.089  53.516  31.607  31.434  33.447  32.286
------------------------------------------------------------------------------------------------------------------------

Last 5 rows:
------------------------------------------------------------------------------------------------------------------------
               AXP     BAC       C     COF       GS     JPM    KEY      MS     PNC     PRU    SCHW     TFC     USB     WFC
Date                                                                                                                      
2020-04-23  77.044  19.201  34.570  47.822  154.295  77.411  8.193  31.405  81.976  40.813  33.005  26.503  26.605  23.094
2020-04-24  77.707  19.473  35.091  51.040  156.014  78.554  8.401  31.824  83.048  42.760  32.783  27.422  27.108  23.433
2020-04-27  79.473  20.606  37.908  53.804  161.779  81.940  8.948  32.947  86.972  45.428  34.116  28.961  28.607  24.730
2020-04-28  82.397  20.975  38.438  57.949  164.837  82.520  9.118  33.634  87.384  46.494  34.338  28.999  29.221  25.131
2020-04-29  89.806  21.756  40.921  63.403  167.499  84.746  9.442  34.488  91.507  49.437  35.922  30.298  30.664  26.114
------------------------------------------------------------------------------------------------------------------------

Step 10 b) 15 non-financial companies¶

In [ ]:
import logging
import pandas as pd
import yfinance as yf
import numpy as np

# 1. Suppress yfinance errors/warnings
logging.getLogger("yfinance").setLevel(logging.CRITICAL)
logging.getLogger("yfinance").propagate = False

# 2. Your existing fetch function
def fetch_adjusted_close(
    tickers: list,
    start: str,
    end: str,
    auto_adjust: bool = False
) -> pd.DataFrame:
    price_col = 'Close' if auto_adjust else 'Adj Close'
    frames, dropped = {}, []

    for t in tickers:
        try:
            df = yf.download(
                t, start=start, end=end,
                auto_adjust=auto_adjust,
                progress=False
            )[[price_col]].rename(columns={price_col: t})
            if df.empty:
                dropped.append(t)
            else:
                frames[t] = df
        except Exception:
            dropped.append(t)

    if not frames:
        raise ValueError("No tickers downloaded successfully.")

    df_price = pd.concat(frames.values(), axis=1)
    df_price = df_price.ffill().dropna(axis=0, how='all').dropna(axis=1, how='all')

    if dropped:
        print(f"Warning: skipped tickers with no data: {dropped}")

    return df_price

# 3. Run it
if __name__ == "__main__":
    np.random.seed(42)
    non_financial_tickers_2020 = [
        'KR','PFE','XOM','WMT','DAL',
        'CSCO','PEAK','EQIX','DUK','NFLX',
        'GE','APA','F','REGN','CMS'
    ]
    start_date, end_date = '2020-03-01','2020-04-30'

    df_non_financials = fetch_adjusted_close(
        tickers=non_financial_tickers_2020,
        start=start_date,
        end=end_date,
        auto_adjust=False
    )

    # Round to three decimals
    df_non_financials = df_non_financials.round(3)

    print(
        f"\n Table 6: Fetched adjusted close prices for "
        f"{df_non_financials.shape[1]} tickers "
        f"from {start_date} to {end_date}\n"
    )
    print("-" * 120)
    print("First 5 rows:")
    print("-" * 120)
    print(df_non_financials.head().to_string())
    print("-" * 120)
    print("\nLast 5 rows:")
    print("-" * 120)
    print(df_non_financials.tail().to_string())
    print("-" * 120)
Warning: skipped tickers with no data: ['PEAK']

 Table 6: Fetched adjusted close prices for 14 tickers from 2020-03-01 to 2020-04-30

------------------------------------------------------------------------------------------------------------------------
First 5 rows:
------------------------------------------------------------------------------------------------------------------------
Price           KR     PFE     XOM     WMT     DAL    CSCO     EQIX     DUK    NFLX      GE     APA      F     REGN     CMS
Ticker          KR     PFE     XOM     WMT     DAL    CSCO     EQIX     DUK    NFLX      GE     APA      F     REGN     CMS
Date                                                                                                                       
2020-03-02  26.408  25.610  41.727  35.569  46.019  34.805  562.702  77.548  381.05  54.470  22.435  5.526  463.469  54.319
2020-03-03  26.106  25.185  39.729  34.657  45.063  33.850  559.319  76.698  368.77  52.866  21.969  5.350  460.278  54.472
2020-03-04  27.547  26.726  40.596  35.842  47.327  34.991  586.896  81.543  383.79  53.207  22.092  5.434  492.120  57.885
2020-03-05  29.780  26.036  38.807  35.581  43.921  33.453  560.692  80.324  372.78  48.979  21.398  5.173  486.824  57.360
2020-03-06  28.508  25.713  36.933  35.983  44.780  33.546  553.418  79.457  368.97  45.720  18.198  4.981  493.067  57.495
------------------------------------------------------------------------------------------------------------------------

Last 5 rows:
------------------------------------------------------------------------------------------------------------------------
Price           KR     PFE     XOM     WMT     DAL    CSCO     EQIX     DUK    NFLX      GE     APA      F     REGN     CMS
Ticker          KR     PFE     XOM     WMT     DAL    CSCO     EQIX     DUK    NFLX      GE     APA      F     REGN     CMS
Date                                                                                                                       
2020-04-23  29.255  26.939  33.650  39.626  21.936  35.519  623.265  68.283  426.70  31.712   9.258  3.753  563.961  50.170
2020-04-24  29.327  27.446  33.867  39.907  21.868  36.288  622.910  68.740  424.99  30.448   9.425  3.738  564.649  50.305
2020-04-27  29.611  28.144  34.029  39.555  21.624  36.748  638.276  69.590  421.38  31.275   9.258  3.968  546.011  49.120
2020-04-28  28.793  27.835  34.827  39.463  23.751  36.262  621.692  70.296  403.83  33.074   9.408  4.129  527.153  50.144
2020-04-29  28.045  27.989  36.755  38.106  26.659  36.987  617.655  69.494  411.89  32.004  11.506  4.037  514.727  48.493
------------------------------------------------------------------------------------------------------------------------

Step 10 c: Combined data structure and there returns¶

This script first fetches the two separate datasets for the financial and non-financial companies (as done in steps 10a and 10b). It then performs the tasks for Member B: merging the two series into a single DataFrame and computing the daily logarithmic returns for the new combined 2020 dataset.

In [ ]:
import pandas as pd
import yfinance as yf
import numpy as np
import logging

# Suppress yfinance informational messages for cleaner output
logging.getLogger("yfinance").setLevel(logging.CRITICAL)
logging.getLogger("yfinance").propagate = False

def fetch_adjusted_close(
    tickers: list,
    start: str,
    end: str,
    auto_adjust: bool = False
) -> pd.DataFrame:
    """
    Downloads historical price data ticker-by-ticker for robustness.
    """
    price_col = 'Close' if auto_adjust else 'Adj Close'
    frames, dropped = {}, []

    for t in tickers:
        try:
            df = yf.download(
                t, start=start, end=end,
                auto_adjust=auto_adjust,
                progress=False
            )[[price_col]].rename(columns={price_col: t})
            if df.empty:
                dropped.append(t)
            else:
                frames[t] = df
        except Exception:
            dropped.append(t)

    if not frames:
        raise ValueError("No tickers downloaded successfully.")

    df_price = pd.concat(frames.values(), axis=1)
    df_price = df_price.ffill().dropna(axis=0, how='all').dropna(axis=1, how='all')

    if dropped:
        print(f"Warning: skipped tickers with no data: {dropped}")

    return df_price

def compute_daily_log_returns(price_df):
    """
    Computes daily logarithmic returns for a DataFrame of prices.
    """
    return np.log(price_df / price_df.shift(1)).dropna(how='all')


if __name__ == "__main__":
    # =========================================================================
    # 1. FETCH BOTH DATASETS (FROM STEPS 10A & 10B)
    # =========================================================================
    financial_tickers_2020 = [
        'JPM', 'WFC', 'BAC', 'C', 'GS', 'USB', 'MS', 'KEY', 'PNC',
        'COF', 'AXP', 'PRU', 'SCHW', 'TFC'
    ]
    non_financial_tickers_2020 = [
        'KR', 'PFE', 'XOM', 'WMT', 'DAL', 'CSCO', 'PEAK', 'EQIX',
        'DUK', 'NFLX', 'GE', 'APA', 'F', 'REGN', 'CMS'
    ]
    start_date = '2020-03-01'
    end_date   = '2020-04-30'

    df_financials = fetch_adjusted_close(
        tickers=financial_tickers_2020,
        start=start_date,
        end=end_date
    )

    df_non_financials = fetch_adjusted_close(
        tickers=non_financial_tickers_2020,
        start=start_date,
        end=end_date
    )

    # =========================================================================
    # 2. MERGE THE SERIES INTO A SINGLE DATA STRUCTURE (STEP 10C)
    # =========================================================================
    """
    The two DataFrames are joined based on their shared 'Date' index. This
    combines them horizontally into a single, wide DataFrame.
    """
    df_combined_2020 = pd.concat([df_financials, df_non_financials], axis=1)

    # Sort columns alphabetically for a consistent order
    df_combined_2020 = df_combined_2020.sort_index(axis=1)

    print(
        f"Combined DataFrame has {df_combined_2020.shape[1]} total tickers."
    )

    # =========================================================================
    # 3. COMPUTE THE RETURNS (STEP 10C)
    # =========================================================================
    """
    The daily logarithmic returns are now calculated for the new, combined dataset,
    creating the final data structure needed for the next steps.
    """
    df_returns_2020 = compute_daily_log_returns(df_combined_2020)
    df_returns_2020 = df_returns_2020.round(4) # Round for clean display


    # =========================================================================
    # 4. DISPLAY THE FINAL RESULT
    # =========================================================================
    print(
        f"\n Table 7: Final Result: Daily Log Returns (March-April 2020)\n"
    )
    print("-" * 180)
    print("First 5 rows:")
    print("-" * 180)
    print(df_returns_2020.head().to_string())
    print("-" * 180)
    print("\nLast 5 rows:")
    print("-" * 180)
    print(df_returns_2020.tail().to_string())
    print("-" * 180)
Warning: skipped tickers with no data: ['PEAK']
Combined DataFrame has 28 total tickers.

 Table 7: Final Result: Daily Log Returns (March-April 2020)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
First 5 rows:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Price          APA     AXP     BAC       C     CMS     COF    CSCO     DAL     DUK    EQIX       F      GE      GS     JPM     KEY      KR      MS    NFLX     PFE     PNC     PRU    REGN    SCHW     TFC     USB     WFC     WMT     XOM
Ticker         APA     AXP     BAC       C     CMS     COF    CSCO     DAL     DUK    EQIX       F      GE      GS     JPM     KEY      KR      MS    NFLX     PFE     PNC     PRU    REGN    SCHW     TFC     USB     WFC     WMT     XOM
Date                                                                                                                                                                                                                                      
2020-03-03 -0.0210 -0.0528 -0.0567 -0.0383  0.0028 -0.0568 -0.0278 -0.0210 -0.0110 -0.0060 -0.0325 -0.0299 -0.0293 -0.0382 -0.0458 -0.0115 -0.0458 -0.0328 -0.0168 -0.0554 -0.0582 -0.0069 -0.0917 -0.0401 -0.0437 -0.0418 -0.0260 -0.0491
2020-03-04  0.0056  0.0688  0.0228  0.0353  0.0608  0.0331  0.0332  0.0490  0.0613  0.0481  0.0157  0.0064  0.0258  0.0244  0.0292  0.0537  0.0187  0.0399  0.0594  0.0307  0.0277  0.0669 -0.0302  0.0104  0.0121  0.0212  0.0336  0.0216
2020-03-05 -0.0319 -0.0420 -0.0520 -0.0596 -0.0091 -0.0487 -0.0450 -0.0747 -0.0151 -0.0457 -0.0492 -0.0828 -0.0488 -0.0503 -0.0435  0.0780 -0.0604 -0.0291 -0.0262 -0.0697 -0.0657 -0.0108 -0.0656 -0.0829 -0.0637 -0.0623 -0.0073 -0.0451
2020-03-06 -0.1620 -0.0246 -0.0408 -0.0354  0.0024 -0.0333  0.0028  0.0194 -0.0108 -0.0131 -0.0378 -0.0689 -0.0303 -0.0531 -0.0720 -0.0437 -0.0178 -0.0103 -0.0125 -0.0550 -0.0345  0.0127 -0.0194 -0.0698 -0.0309 -0.0476  0.0112 -0.0495
2020-03-09 -0.7736 -0.0964 -0.1590 -0.1764 -0.0371 -0.1188 -0.0443 -0.0530 -0.0462 -0.0585 -0.0953 -0.1354 -0.1097 -0.1456 -0.2012 -0.0250 -0.1095 -0.0629 -0.0366 -0.1456 -0.1812 -0.0413 -0.1200 -0.1642 -0.1560 -0.1327 -0.0006 -0.1304
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Last 5 rows:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Price          APA     AXP     BAC       C     CMS     COF    CSCO     DAL     DUK    EQIX       F      GE      GS     JPM     KEY      KR      MS    NFLX     PFE     PNC     PRU    REGN    SCHW     TFC     USB     WFC     WMT     XOM
Ticker         APA     AXP     BAC       C     CMS     COF    CSCO     DAL     DUK    EQIX       F      GE      GS     JPM     KEY      KR      MS    NFLX     PFE     PNC     PRU    REGN    SCHW     TFC     USB     WFC     WMT     XOM
Date                                                                                                                                                                                                                                      
2020-04-23  0.1085 -0.0010  0.0032  0.0052 -0.0126  0.0209 -0.0034  0.0004 -0.0236  0.0010  0.0248  0.0139 -0.0053  0.0006  0.0384  0.0175 -0.0090  0.0125  0.0121  0.0053 -0.0069  0.0132 -0.0064  0.0085  0.0036 -0.0101 -0.0235  0.0309
2020-04-24  0.0179  0.0086  0.0141  0.0150  0.0027  0.0651  0.0214 -0.0031  0.0067 -0.0006 -0.0041 -0.0407  0.0111  0.0147  0.0251  0.0024  0.0133 -0.0040  0.0186  0.0130  0.0466  0.0012 -0.0068  0.0341  0.0187  0.0146  0.0071  0.0064
2020-04-27 -0.0179  0.0225  0.0565  0.0772 -0.0239  0.0527  0.0126 -0.0112  0.0123  0.0244  0.0598  0.0268  0.0363  0.0422  0.0631  0.0097  0.0347 -0.0085  0.0251  0.0462  0.0605 -0.0336  0.0399  0.0546  0.0538  0.0539 -0.0088  0.0048
2020-04-28  0.0161  0.0361  0.0177  0.0139  0.0206  0.0742 -0.0133  0.0938  0.0101 -0.0263  0.0398  0.0559  0.0187  0.0071  0.0188 -0.0280  0.0206 -0.0425 -0.0110  0.0047  0.0232 -0.0351  0.0065  0.0013  0.0212  0.0161 -0.0023  0.0232
2020-04-29  0.2014  0.0861  0.0366  0.0626 -0.0335  0.0899  0.0198  0.1155 -0.0115 -0.0065 -0.0226 -0.0329  0.0160  0.0266  0.0349 -0.0263  0.0251  0.0198  0.0055  0.0461  0.0614 -0.0239  0.0451  0.0438  0.0482  0.0384 -0.0350  0.0539
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 11: Rerunning the algorithms (UCB and ε-greedy) on the new data¶

In [ ]:
import pandas as pd
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
import math
import random
import logging

np.random.seed(42)
# =============================================================================
# SETUP
# =============================================================================

# 1) Suppress yfinance INFO/ERROR logs
logging.getLogger("yfinance").setLevel(logging.CRITICAL)
logging.getLogger("yfinance").propagate = False

# 2) For reproducibility
np.random.seed(42)
random.seed(42)

# =============================================================================
# 1. DATA FETCH & PREPARATION
# =============================================================================

def fetch_adjusted_close(
    tickers: list,
    start: str,
    end: str,
    auto_adjust: bool = False
) -> pd.DataFrame:
    """
    Downloads historical adjusted-close series ticker-by-ticker.
    Skips any symbols that fail or return empty data.
    """
    price_col = 'Close' if auto_adjust else 'Adj Close'
    frames, dropped = {}, []

    for t in tickers:
        try:
            df = yf.download(
                t,
                start=start,
                end=end,
                auto_adjust=auto_adjust,
                progress=False
            )[[price_col]].rename(columns={price_col: t})

            if df.empty:
                dropped.append(t)
            else:
                frames[t] = df

        except Exception:
            dropped.append(t)

    if not frames:
        raise ValueError("No tickers downloaded successfully.")

    df_price = pd.concat(frames.values(), axis=1)
    df_price = df_price.ffill().dropna(how='all', axis=0).dropna(how='all', axis=1)

    if dropped:
        print(f"Warning: skipped tickers with no data: {dropped}")

    return df_price

def compute_daily_log_returns(price_df: pd.DataFrame) -> pd.DataFrame:
    """
    Computes daily log returns from a price DataFrame.
    """
    return np.log(price_df / price_df.shift(1)).dropna(how='all')

# =============================================================================
# 2. BANDIT ALGORITHMS
# =============================================================================

def run_ucb_simulation(returns_df: pd.DataFrame, c: float) -> list:
    """
    Upper-Confidence-Bound bandit on historical returns.
    c controls exploration strength.
    """
    arms = returns_df.columns.tolist()
    counts = {a: 0 for a in arms}
    sums   = {a: 0.0 for a in arms}
    cumulative, total = [], 0.0

    for t in range(len(returns_df)):
        if t < len(arms):
            choice = arms[t]
        else:
            ucb_scores = {}
            for a in arms:
                avg_reward = sums[a] / counts[a]
                bonus = c * math.sqrt(math.log(t) / counts[a])
                ucb_scores[a] = avg_reward + bonus
            choice = max(ucb_scores, key=ucb_scores.get)

        reward = returns_df.iloc[t][choice]
        counts[choice] += 1
        sums[choice]   += reward
        total += reward
        cumulative.append(total)

    return cumulative

def run_epsilon_greedy_simulation(returns_df: pd.DataFrame, epsilon: float) -> list:
    """
    Epsilon-greedy bandit on historical returns.
    epsilon fraction of days is pure exploration.
    """
    arms = returns_df.columns.tolist()
    counts = {a: 0 for a in arms}
    sums   = {a: 0.0 for a in arms}
    cumulative, total = [], 0.0

    for t in range(len(returns_df)):
        if random.random() < epsilon:
            choice = random.choice(arms)
        else:
            # explore any untried arm first
            untried = [a for a in arms if counts[a] == 0]
            if untried:
                choice = random.choice(untried)
            else:
                avg_rewards = {a: sums[a] / counts[a] for a in arms}
                choice = max(avg_rewards, key=avg_rewards.get)

        reward = returns_df.iloc[t][choice]
        counts[choice] += 1
        sums[choice]   += reward
        total += reward
        cumulative.append(total)

    return cumulative

def run_equal_weight_simulation(returns_df: pd.DataFrame) -> list:
    """
    Baseline: equal-weighted basket return cumulated over time.
    """
    return returns_df.mean(axis=1).cumsum().tolist()

# =============================================================================
# 3. MAIN EXECUTION & PLOTTING
# =============================================================================

if __name__ == "__main__":
    # --- 3A) Define tickers & dates ---
    financial_tickers = [
        'JPM','WFC','BAC','C','GS',
        'USB','MS','KEY','PNC','COF',
        'AXP','PRU','SCHW','TFC'
    ]
    non_financial_tickers = [
        'KR','PFE','XOM','WMT','DAL',
        'CSCO','PEAK','EQIX','DUK',
        'NFLX','GE','APA','F','REGN','CMS'
    ]
    start_date = '2020-03-01'
    end_date   = '2020-04-30'

    # --- 3B) Fetch prices & compute returns ---
    df_fin = fetch_adjusted_close(financial_tickers, start_date, end_date)
    df_non = fetch_adjusted_close(non_financial_tickers, start_date, end_date)
    df_all = pd.concat([df_fin, df_non], axis=1).sort_index(axis=1)
    df_ret = compute_daily_log_returns(df_all).round(4)

    # --- 3C) Run simulations ---
    results = {}
    # baseline
    results['Equal-Weight'] = run_equal_weight_simulation(df_ret)

    # UCB hyperparameters
    ucb_cs = [0.5, 2.0, 5.0]
    for c in ucb_cs:
        label = f'UCB (c={c})'
        results[label] = run_ucb_simulation(df_ret, c)

    # Epsilon-greedy hyperparameters
    epsilons = [0.05, 0.1, 0.2]
    for e in epsilons:
        label = f'E-Greedy (ε={e})'
        results[label] = run_epsilon_greedy_simulation(df_ret, e)

    # --- 3D) Plot comparison ---
    plt.figure(figsize=(12, 6))

    # Equal-weight baseline
    plt.plot(results['Equal-Weight'],
             label='Equal-Weight', color='black', linestyle='--', linewidth=2)

    # UCB family
    ucb_colors = ['#6baed6','#1f77b4','#08519c']
    for i, c in enumerate(ucb_cs):
        key = f'UCB (c={c})'
        plt.plot(results[key],
                 label=key,
                 color=ucb_colors[i],
                 linewidth=2.5)

    # Epsilon-greedy family
    eg_colors = ['#98df8a','#2ca02c','#006d2c']
    for i, e in enumerate(epsilons):
        key = f'E-Greedy (ε={e})'
        plt.plot(results[key],
                 label=key,
                 color=eg_colors[i],
                 linestyle=':',
                 linewidth=2)

    plt.title('Figure 6: Algorithm Performance on 2020 Data (COVID-19 Crash)', fontsize=16)
    plt.xlabel('Trading Days (Mar–Apr 2020)', fontsize=12)
    plt.ylabel('Cumulative Log Reward', fontsize=12)
    plt.grid(linestyle='--', alpha=0.5)
    plt.legend(title='Strategy', fontsize=10)
    plt.tight_layout()

    # Save & show
    plt.savefig('algorithm_comparison_2020.png')
    plt.show()
Warning: skipped tickers with no data: ['PEAK']
No description has been provided for this image

The simulation results (Figure 6) show dramatic underperformance of both bandit algorithms relative to the passive, Equal-Weight baseline portfolio.

  • Equal-Weight Portfolio: This simple 1/N diversification strategy, which failed decisively in the 2008 dataset, was the top performer in 2020. It captured the market's powerful and broad-based V-shaped recovery in late March and April, ending with a significant positive cumulative reward.

  • UCB Algorithm: All variations of the UCB algorithm performed extremely poorly, resulting in substantial losses. The strategy appears to have quickly locked onto poorly performing assets during the initial market crash and was unable to adapt to the sharp, uniform rebound. Notably, higher values for the exploration parameter c led to worse outcomes, suggesting that UCB’s "intelligent" exploration was detrimental, causing the agent to over-explore declining assets it was most uncertain about.

  • Epsilon-Greedy Algorithm: While also underperforming the baseline, the ε-Greedy strategies ended with smaller losses than their UCB counterparts. The variant with the highest random exploration (ε=0.2) performed best among the learning agents. This suggests that in a chaotic, rapidly reversing market, the "undirected" and random nature of ε-Greedy's exploration was more beneficial than UCB's focused exploration, as it allowed the agent to break away from persistently choosing losing stocks.

The stark contrast between the 2008 and 2020 results highlights the critical challenge of non-stationarity in financial markets. An algorithm that proves superior in one market regime may fail completely in another. Financial model performance is highly regime-dependent, with the statistical properties of returns shifting dramatically between stable and crisis periods (Ang and Timmermann, 4). The COVID-19 crisis, in particular, was unique. Unlike the more prolonged decline in 2008, the 2020 event was a "V-shaped" crash and rebound characterized by unprecedented speed and driven by massive, coordinated fiscal and monetary policy responses.

This difference in market dynamics explains the performance inversion. In the 2008 crisis, the decline was slower, allowing the UCB algorithm time to identify a few relatively outperforming assets to exploit. In 2020, the market crash was so sudden and the subsequent recovery so broad that a simple, passive strategy of holding all assets was optimal. The bandit algorithms, having just learned from a steep decline, likely continued to select assets that had performed well in the immediate past (i.e., defensive or less volatile stocks) and therefore missed the aggressive, market-wide rebound. This aligns with recent findings that complex, adaptive models can be fragile and that simpler strategies often prove more robust during periods of extreme, unprecedented market stress (Leal et al., 21). The failure of the bandit models in 2020 serves as a powerful illustration of this principle.

Conclusion¶

The effectiveness of multi-armed bandit algorithms in portfolio selection is highly dependent on the market regime. In the prolonged 2008 financial crisis, the UCB algorithm's intelligent exploration strategy significantly outperformed both the naive Equal-Weight portfolio and the less efficient Epsilon-Greedy algorithm, proving its ability to identify and exploit outperforming assets in a trending downturn. Conversely, during the rapid V-shaped crash and recovery of the COVID-19 pandemic in 2020, both bandit algorithms failed dramatically, with the simple Equal-Weight strategy yielding the best results by capturing the broad and swift market rebound. This stark contrast underscores a critical challenge in financial machine learning: adaptive strategies optimized for one set of market dynamics can be brittle and counterproductive in another, highlighting the superior robustness of simpler, passive strategies during periods of unprecedented, rapid market reversals.

References¶

Ang, Andrew, and Allan Timmermann. "Regime Changes and Financial Markets." Annual Review of Financial Economics, vol. 4, no. 1, 2012, pp. 313-337.

Auer, Peter, et al. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, vol. 47, no. 2/3, 2002, pp. 235-56.

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science, vol. 4, no. 11, 2017, pp. 1-16.

Kolm, Petter N., and Gordon Ritter. "Dynamic Replication and Hedging: A Reinforcement Learning Approach." The Journal of Financial Data Science, vol. 1, no. 3, 2019, pp. 93-113.

Leal, Ricardo P. C., et al. "Portfolio Selection in the Times of COVID-19." Revista Brasileira de Finanças, vol. 19, no. 3, 2021, pp. 1-25.

Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd ed., The MIT Press, 2018.