Data Analytics for Finance

BM17FI · Rotterdam School of Management

RSM Logo

ASSIGNMENT 06

Replication Exercise

Learning Objectives¶

This assignment prepares you for MSc thesis Module 2/3 (replication and extension) by having you:

  1. Read and understand methodology from a published empirical finance paper
  2. Build an analysis panel from multiple raw data sources (merge, reshape)
  3. Calculate returns following published methodology (compound, standardized)
  4. Create derived variables (terciles, interactions)
  5. Work with panel data at geographic (municipality) level
  6. Estimate fixed effects regressions with appropriate standard errors
  7. Interpret results and discuss limitations honestly
  8. Understand the gap between "ideal" identification and feasible tests
📝 Assignment Tasks
In this assignment, you will:
  1. Load and examine three raw datasets (crime, income, stock returns)
  2. Calculate monthly stock returns following published methodology
  3. Reshape crime data from long to wide format
  4. Merge datasets to create a municipality-month panel
  5. Create analysis variables (income terciles, interactions)
  6. Estimate panel regressions testing differential effects
  7. Interpret findings and discuss limitations
  8. Reflect on thesis Module 2/3 connection

Setup¶

Environment Setup¶

We begin by clearing the environment and preparing Stata for analysis.

✅ Graph scheme set to stcolor.
✅ The environment is cleared and ready.

Set File Paths¶

We define global macros for all file paths to keep code clean and portable.

/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment
> s/06-assignment
📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment
📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat
> e/private/assignments/06-assignment/data/raw
📁 Processed data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-
> private/private/assignments/06-assignment/data/processed
📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva
> te/private/assignments/06-assignment/output
📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/
> private/assignments/06-assignment/output/tables
📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/output/figures

Part A: Data Preparation¶

Section 1: Introduction and Paper Context¶

Research Question¶

This assignment is based on Huck (2024) "The Psychological Externalities of Investing: Evidence from Stock Returns and Crime" published in The Review of Financial Studies.

Key finding from Huck (2024):

  • When stock markets rise, violent crime decreases in high-income areas (where people are more likely to be investors)
  • When stock markets rise, violent crime increases in low-income areas (where people are less likely to be investors)
  • This suggests that relative wealth/status affects psychological well-being and behavior

What We're Testing¶

We test whether this differential relationship holds in the Netherlands using:

  • Monthly crime data from CBS/Politie (~350 municipalities, 2012-2019)
  • Monthly AEX index returns
  • Cross-sectional income data by municipality

Why This Assignment Matters for Your Thesis¶

Published research often starts with an observation in one setting. A valuable contribution is testing whether findings generalize to new contexts. Even null results are informative.

In your thesis, you will spend 60-80% of your time preparing data. This assignment gives you practice with the exact operations you'll need:

  • Merging datasets from different sources
  • Calculating returns properly (following published methodology)
  • Reshaping data
  • Creating analysis variables

Assignment Roadmap¶

Part A: Data Preparation

  1. Load raw data (crime, income, AEX prices)
  2. Calculate monthly returns (compound and standardized)
  3. Reshape crime data (long → wide)
  4. Merge datasets into a municipality-month panel
  5. Create analysis variables (terciles, interactions)
ℹ️ Why Return Calculation Matters
Huck (2024) uses standardized returns: daily returns divided by trailing 252-day standard deviation. This follows Engelberg & Parsons (2016) and Edmans, Garcia & Norli (2007).

Standardization matters because a 2% return during low volatility is more "newsworthy" than 2% during a crisis. A standardized return of 2 means the market moved 2 standard deviations—regardless of the volatility regime.

For monthly analysis, we aggregate standardized daily returns. This preserves the "surprise" interpretation while matching our crime data frequency.

⚠️ Important Limitation
The original paper uses daily data to identify contemporaneous same-day effects. Our monthly data cannot test this timing mechanism.

We are testing a weaker hypothesis: whether monthly returns correlate with monthly crime differently across income groups.

This is a common constraint in replication work—data availability often limits what can be tested. Your thesis will face similar challenges.


Section 2: Load and Examine Raw Data¶

We have three raw datasets to work with:

  1. crime.dta: Monthly crime counts by municipality (long format by crime type)
  2. income_data.dta: Average median income by municipality (cross-section)
  3. aex_daily.dta: Daily AEX index prices (we'll compute returns)

Let's load and examine each dataset to understand its structure.

Task 2.1: Load and Examine Crime Data¶

📝 Task
Load the crime dataset from $raw/crime.dta and examine its structure. The data should contain monthly crime counts for each municipality and crime type. Use describe to see variables, list to see the first few observations, and summarize to get basic statistics.
Stata Stata Tip
Consider using the following commands:
  • use "$raw/crime.dta", clear — to load the dataset
  • describe — to see variable names and types
  • list in 1/10 — to view first 10 rows
  • tab crime_type — to see what crime types are included
  • summarize reported_crimes — to get basic statistics
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/crime.dta
 Observations:       172,344                  
    Variables:             5                  06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
crime_type      str21   %-9s                  Type of Crime
municipality~de str6    %-9s                  Municipality Code
municipality~me str29   %-9s                  Municipality Name
reported_crimes long    %12.0g                
date            double  %td                   Date (Year-Month)
-------------------------------------------------------------------------------
Sorted by: 

     +-------------------------------------------------------------------+
     | crime_type          munic~de   municipa~me   report~s        date |
     |-------------------------------------------------------------------|
  1. | Totaal misdrijven   GM1680     Aa en Hunze         84   01jan2012 |
  2. | Totaal misdrijven   GM1680     Aa en Hunze        103   01feb2012 |
  3. | Totaal misdrijven   GM1680     Aa en Hunze        101   01mar2012 |
  4. | Totaal misdrijven   GM1680     Aa en Hunze         71   01apr2012 |
  5. | Totaal misdrijven   GM1680     Aa en Hunze         78   01may2012 |
     |-------------------------------------------------------------------|
  6. | Totaal misdrijven   GM1680     Aa en Hunze         50   01jun2012 |
  7. | Totaal misdrijven   GM1680     Aa en Hunze         69   01jul2012 |
  8. | Totaal misdrijven   GM1680     Aa en Hunze         99   01aug2012 |
  9. | Totaal misdrijven   GM1680     Aa en Hunze         73   01sep2012 |
 10. | Totaal misdrijven   GM1680     Aa en Hunze         77   01oct2012 |
     +-------------------------------------------------------------------+

        Type of Crime |      Freq.     Percent        Cum.
----------------------+-----------------------------------
1.4.2 Moord, doodslag |     57,448       33.33       33.33
   1.4.5 Mishandeling |     57,448       33.33       66.67
    Totaal misdrijven |     57,448       33.33      100.00
----------------------+-----------------------------------
                Total |    172,344      100.00

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
reported_c~s |    126,324    101.4997     377.645          1       9239
---- CHECKPOINT: Crime data loaded ----
Observations: 172344
Variables: 5
✅ Task 2.1 tests passed

Task 2.2: Load and Examine Income Data¶

📝 Task
Load the income dataset from $raw/income_data.dta and examine its structure. This is a cross-sectional dataset with one observation per municipality containing average median standardized income for 2011-2018.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/income_data.dta
 Observations:           355                  
    Variables:             2                  06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
municipality_~e str6    %-9s                  Municipality Code
avg_median_st~e double  %10.0g                Average Median Standardized
                                                Income
-------------------------------------------------------------------------------
Sorted by: 

     +----------------------+
     | munici~e   avg_med~e |
     |----------------------|
  1. | GM0003     20.606186 |
  2. | GM0010     21.084211 |
  3. | GM0014     19.855506 |
  4. | GM0024     22.255462 |
  5. | GM0034     22.317833 |
     |----------------------|
  6. | GM0037     20.437981 |
  7. | GM0047      20.88774 |
  8. | GM0050     23.638929 |
  9. | GM0059     20.745074 |
 10. | GM0060     23.382847 |
     +----------------------+

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avg_median~e |        355    23.53126    1.867351   19.83227   36.45822
---- CHECKPOINT: Income data loaded ----
Municipalities: 355
✅ Task 2.2 tests passed

Task 2.3: Load and Examine AEX Daily Price Data¶

📝 Task
Load the daily AEX index data from $raw/aex_daily_data.dta and examine its structure. This dataset contains daily closing prices for the AEX index. We will calculate returns from these prices in the next section.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/aex_daily_data.dta
 Observations:         2,609                  
    Variables:             2                  06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
date            double  %td                   Date
price           double  %10.0g                AEX Index Closing Price
-------------------------------------------------------------------------------
Sorted by: 

     +----------------------+
     |      date      price |
     |----------------------|
  1. | 03jan2011   359.8629 |
  2. | 04jan2011   358.8617 |
  3. | 05jan2011   357.2818 |
  4. | 06jan2011   356.8858 |
  5. | 07jan2011   356.4425 |
     |----------------------|
  6. | 10jan2011   354.2092 |
  7. | 11jan2011    358.303 |
  8. | 12jan2011   362.4137 |
  9. | 13jan2011   360.7873 |
 10. | 14jan2011   361.3202 |
     +----------------------+

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |      2,609    451.2762    92.43622   263.4371   629.2317

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        date |      2,609     20454.8    1054.616      18630      22280
---- CHECKPOINT: AEX data loaded ----
Trading days: 2609
✅ Task 2.3 tests passed

Section 3: Calculate AEX Returns (Following Huck 2024 Methodology)¶

Following Huck (2024), we need to calculate standardized returns. This involves:

  1. Calculate daily returns
  2. Calculate trailing 252-day standard deviation (one trading year)
  3. Standardize daily returns by dividing by trailing SD
  4. Aggregate to monthly frequency

We'll also calculate compound monthly returns for comparison.

Task 3.1: Calculate Daily Returns¶

📝 Task
Calculate daily returns from the AEX price data. Create both:
  • Simple returns: ret_daily = (pricet / pricet-1) - 1
  • Log returns: ret_log = ln(pricet / pricet-1)
Make sure the data is sorted by date first.
Stata Stata Tip
For time series operations:
  • sort date — sort data chronologically
  • gen ret_daily = (price / price[_n-1]) - 1 — simple return
  • gen ret_log = ln(price / price[_n-1]) — log return
  • summarize ret_daily, detail — check for reasonable range
(1 missing value generated)
(1 missing value generated)

                     Daily simple return
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.0323815      -.1075333
 5%    -.0172826      -.0764755
10%    -.0116792        -.05704       Obs               2,608
25%    -.0047384       -.052363       Sum of wgt.       2,608

50%     .0004368                      Mean           .0002727
                        Largest       Std. dev.      .0110496
75%     .0058393       .0421586
90%     .0118066       .0452309       Variance       .0001221
95%     .0171069       .0461471       Skewness      -.5256861
99%     .0295844       .0897213       Kurtosis       10.82545
Non-missing returns: 2608
Mean daily return: 0.0003
Std dev: 0.0110
✅ Task 3.1 tests passed

Task 3.2: Calculate Trailing 252-Day Standard Deviation¶

📝 Task
Calculate the trailing 252-day (approximately one trading year) standard deviation of daily returns. This will be used to standardize returns. Use the rangestat package to calculate rolling statistics efficiently.
Stata Stata Tip
The rangestat command calculates statistics over a moving window:
  • rangestat (sd) trailing_sd = ret_daily, interval(date -252 -1)
  • The interval -252 -1 means "from 252 days ago to 1 day ago" (excludes today)
  • First ~252 observations will have missing trailing_sd (not enough history)
               Trailing 252-day SD of returns
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0052468       .0011458
 5%     .0058144       .0015453
10%     .0067857       .0016472       Obs               2,606
25%     .0075677       .0021919       Sum of wgt.       2,606

50%     .0084327                      Mean           .0102832
                        Largest       Std. dev.      .0038261
75%     .0130778       .0203429
90%     .0162531       .0203489       Variance       .0000146
95%     .0187518       .0203502       Skewness       .9560908
99%      .020195       .0203506       Kurtosis       2.984837
  3
Missing SD values (first ~252 days): 3
Mean trailing SD: 0.0103
✅ Task 3.2 tests passed

Task 3.3: Calculate Standardized Daily Returns¶

📝 Task
Create standardized daily returns by dividing each day's return by its trailing standard deviation. This follows the Huck (2024) methodology and creates a measure of "surprise" that accounts for the volatility regime.

Name the new variable ret_std_daily.

(3 missing values generated)

             Standardized daily return (ret/SD)
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -3.347214      -9.463235
 5%    -1.738406      -7.754665
10%    -1.204699      -5.648883       Obs               2,606
25%    -.5011903       -4.94842       Sum of wgt.       2,606

50%      .039744                      Mean           .0189555
                        Largest       Std. dev.      1.077812
75%     .6185057       3.729102
90%     1.172071       4.049637       Variance       1.161679
95%     1.639769        5.27277       Skewness      -.7527249
99%      2.53291       5.910226       Kurtosis       8.611722
Non-missing standardized returns: 2606
Mean: 0.0190
Std dev: 1.0778
✅ Task 3.3 tests passed

Task 3.4: Create Year-Month Variable¶

📝 Task
Create a year-month variable that we'll use to aggregate daily data to monthly frequency. Use Stata's mofd() function to convert daily dates to monthly format.
Stata Stata Tip
To work with monthly dates:
  • gen year_month = mofd(date) — convert daily date to monthly
  • format year_month %tm — format as YYYY-MM
✅ Task 3.4 tests passed

Task 3.5: Aggregate to Monthly Frequency¶

📝 Task
Aggregate daily data to monthly frequency. Calculate:
  • Compound return: exp(sum of log returns) - 1
  • Standardized monthly return: sum of standardized daily returns
  • Volatility: standard deviation of daily returns
  • Trading days: count of observations per month
Stata Stata Tip
First create monthly aggregates within groups, then collapse:
  • bysort year_month: egen ret_log_sum = total(ret_log)
  • gen ret_compound = exp(ret_log_sum) - 1
  • bysort year_month: egen ret_std_monthly = total(ret_std_daily)
  • collapse (first) ret_compound ret_std_monthly (sd) volatility = ret_daily (count) n_trading_days = ret_daily, by(year_month)
  • gen date = dofm(year_month) — create a daily date variable for later merging with crime data
  • format date %td — apply date format
     +--------------------------------------------------------------------+
     | year_m~h   ret_com~d   ret_std~y   volati~y   n_trad~s        date |
     |--------------------------------------------------------------------|
  1. |   2011m1    .0024548    .5534533    .007666         20   01jan2011 |
  2. |   2011m2    .0232263    2.948537   .0068946         20   01feb2011 |
  3. |   2011m3   -.0094981   -2.251385   .0097187         23   01mar2011 |
  4. |   2011m4   -.0155369   -1.979866   .0076843         21   01apr2011 |
  5. |   2011m5   -.0291778   -3.587304   .0083145         22   01may2011 |
     |--------------------------------------------------------------------|
  6. |   2011m6    -.028016   -3.423634   .0093389         22   01jun2011 |
  7. |   2011m7   -.0308594   -3.683877    .010141         21   01jul2011 |
  8. |   2011m8   -.1100779   -13.34819   .0223073         23   01aug2011 |
  9. |   2011m9   -.0435389   -3.706579    .023562         22   01sep2011 |
 10. |  2011m10    .0975143    6.884334   .0174383         21   01oct2011 |
     |--------------------------------------------------------------------|
 11. |  2011m11   -.0254433   -1.531934   .0208009         22   01nov2011 |
 12. |  2011m12    .0427059    2.665466   .0124832         22   01dec2011 |
     +--------------------------------------------------------------------+

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
  year_month |        120       671.5    34.78505        612        731
ret_compound |        120    .0054344    .0408573  -.1100779   .1351352
ret_std_mo~y |        120    .4116507    4.289544  -13.34819    8.55268
  volatility |        120    .0098128    .0052895    .003573   .0422672
n_trading_~s |        120    21.73333    .9850422         20         23
-------------+---------------------------------------------------------
        date |        120    20438.45    1058.793      18628      22250
---- CHECKPOINT: Monthly aggregation complete ----
Months: 120
Mean monthly return: 0.0054
SD of monthly returns: 0.0409
✅ Task 3.5 tests passed

Task 3.6: Save Monthly Returns Data¶

📝 Task
Save the monthly returns data to $processed/aex_monthly.dta. This will be merged with the crime data later.
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/06-assignment/data/processed/aex_monthly.dta saved
✅ Task 3.6 tests passed

Section 4: Reshape Crime Data (Long to Wide)¶

The crime data is currently in long format: one row per municipality × month × crime_type combination.

We need wide format: one row per municipality × month, with separate columns for each crime type.

Why? For regression analysis, we need a single outcome variable per observation.

⚠️ Important: Reshape Syntax
Stata's reshape command requires:
  1. Unique i() identifier — here: municipality_code date
  2. j() variable with values to become column suffixes — here: crime_id (numeric)
  3. Variables to reshape — here: reported_crimes
Common errors:
  • Duplicates in i() j() combinations → check with: duplicates report municipality_code date crime_id
  • String j() variable → must be numeric (that's why we create crime_id)

Task 4.1: Examine Current Structure¶

📝 Task
Load the crime data and examine its current long format structure. Use tab crime_type to see what crime types are included.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/crime.dta
 Observations:       172,344                  
    Variables:             5                  06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
crime_type      str21   %-9s                  Type of Crime
municipality~de str6    %-9s                  Municipality Code
municipality~me str29   %-9s                  Municipality Name
reported_crimes long    %12.0g                
date            double  %td                   Date (Year-Month)
-------------------------------------------------------------------------------
Sorted by: 

        Type of Crime |      Freq.     Percent        Cum.
----------------------+-----------------------------------
1.4.2 Moord, doodslag |     57,448       33.33       33.33
   1.4.5 Mishandeling |     57,448       33.33       66.67
    Totaal misdrijven |     57,448       33.33      100.00
----------------------+-----------------------------------
                Total |    172,344      100.00

     +--------------------------------------------------------+
     | municipa~me        date   crime_type          report~s |
     |--------------------------------------------------------|
  1. | Aa en Hunze   01jan2012   Totaal misdrijven         84 |
  2. | Aa en Hunze   01feb2012   Totaal misdrijven        103 |
  3. | Aa en Hunze   01mar2012   Totaal misdrijven        101 |
  4. | Aa en Hunze   01apr2012   Totaal misdrijven         71 |
  5. | Aa en Hunze   01may2012   Totaal misdrijven         78 |
     |--------------------------------------------------------|
  6. | Aa en Hunze   01jun2012   Totaal misdrijven         50 |
  7. | Aa en Hunze   01jul2012   Totaal misdrijven         69 |
  8. | Aa en Hunze   01aug2012   Totaal misdrijven         99 |
  9. | Aa en Hunze   01sep2012   Totaal misdrijven         73 |
 10. | Aa en Hunze   01oct2012   Totaal misdrijven         77 |
     |--------------------------------------------------------|
 11. | Aa en Hunze   01nov2012   Totaal misdrijven         67 |
 12. | Aa en Hunze   01dec2012   Totaal misdrijven         53 |
 13. | Aa en Hunze   01jan2013   Totaal misdrijven         76 |
 14. | Aa en Hunze   01feb2013   Totaal misdrijven         65 |
 15. | Aa en Hunze   01mar2013   Totaal misdrijven         56 |
     +--------------------------------------------------------+
---- CHECKPOINT: Crime data structure examined ----
Total observations (municipality × month × crime_type): 172344
Crime types: 3
✅ Task 4.1 tests passed

Task 4.2: Create Numeric Crime Type Identifier¶

📝 Task
Create a numeric identifier for crime types because reshape requires numeric j() variable:
  • 1 = "Totaal misdrijven" (Total crimes)
  • 2 = "1.4.5 Mishandeling" (Assault)
  • 3 = "1.4.2 Moord, doodslag" (Murder/manslaughter)

Name the new variable crime_id.

(172,344 missing values generated)
(57,448 real changes made)
(57,448 real changes made)
(57,448 real changes made)

           |          Type of Crime
  crime_id | 1.4.2 M..  1.4.5 M..  Totaal .. |     Total
-----------+---------------------------------+----------
     total |         0          0     57,448 |    57,448 
   assault |         0     57,448          0 |    57,448 
    murder |    57,448          0          0 |    57,448 
-----------+---------------------------------+----------
     Total |    57,448     57,448     57,448 |   172,344 
✅ Task 4.2 tests passed

Task 4.3: Reshape Wide¶

📝 Task
Reshape the data from long to wide format, creating separate columns for each crime type. After reshaping, rename the variables to meaningful names (total_crimes, assault_crimes, murder_crimes).
Stata Stata Tip
Reshaping steps:
  • keep municipality_code municipality_name date crime_id reported_crimes — keep only needed variables
  • reshape wide reported_crimes, i(municipality_code date) j(crime_id)
  • rename reported_crimes1 total_crimes — rename to meaningful names
Duplicates in terms of municipality_code date crime_id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       172344             0
--------------------------------------
(j = 1 2 3)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations          172,344   ->   57,448      
Number of variables                   5   ->   6           
j variable (3 values)          crime_id   ->   (dropped)
xij variables:
                        reported_crimes   ->   reported_crimes1 reported_crimes
> 2 reported_crimes3
-----------------------------------------------------------------------------

Contains data
 Observations:        57,448                  
    Variables:             6                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
municipality~de str6    %-9s                  Municipality Code
date            double  %td                   Date (Year-Month)
total_crimes    long    %12.0g                Total crimes reported
assault_crimes  long    %12.0g                Assault crimes reported
murder_crimes   long    %12.0g                Murder/manslaughter crimes
                                                reported
municipality~me str29   %-9s                  Municipality Name
-------------------------------------------------------------------------------
Sorted by: municipality_code  date
     Note: Dataset has changed since last saved.

     +-------------------------------------------------------+
  1. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01jan2012 |     1562 |      111 |        5 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  2. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01feb2012 |     1373 |       95 |        4 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  3. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01mar2012 |     1663 |      105 |        8 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  4. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01apr2012 |     1520 |       91 |        3 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  5. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01may2012 |     1728 |       90 |        3 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  6. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01jun2012 |     1547 |       98 |        8 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  7. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01jul2012 |     1598 |       86 |        4 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  8. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01aug2012 |     1450 |       83 |        5 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
  9. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01sep2012 |     1742 |      100 |        6 |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+

     +-------------------------------------------------------+
 10. | munic~de |      date | total_~s | assaul~s | murder~s |
     | GM0014   | 01oct2012 |     1703 |      108 |        . |
     |-------------------------------------------------------|
     |                 municipality_name                     |
     |                 Groningen (gemeente)                  |
     +-------------------------------------------------------+
---- CHECKPOINT: Data reshaped wide ----
Observations (municipality × month): 57448

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
total_crimes |     57,243    212.5807    539.8091          1       9239
assault_cr~s |     51,940    11.79105    28.66676          1        463
murder_cri~s |     17,141    2.372032    3.827666          1         48
✅ Task 4.3 tests passed

Task 4.4: Handle Missing Values¶

📝 Task
Check for and handle missing values in the crime count variables. Missing crime counts should be treated as zero (no crimes reported that month).
  205
Replacing 205 missing values in total_crimes with 0
(205 real changes made)
  5,508
Replacing 5508 missing values in assault_crimes with 0
(5,508 real changes made)
  40,307
Replacing 40307 missing values in murder_crimes with 0
(40,307 real changes made)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
total_crimes |     57,448    211.8221    538.9942          0       9239
assault_cr~s |     57,448    10.66055    27.47804          0        463
murder_cri~s |     57,448    .7077531     2.35568          0         48
  0
Remaining missing values: 0
✅ Task 4.4 tests passed

Task 4.5: Save Wide Crime Data¶

📝 Task
Save the reshaped wide crime data to $processed/crime_wide.dta.
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/06-assignment/data/processed/crime_wide.dta saved
✅ Wide crime data saved to: /Users/casparm4/Github/rsm-data-analytics-in-financ
> e-private/private/assignments/06-assignment/data/processed/crime_wide.dta
Observations: 57448
✅ Task 4.5 tests passed

Section 5: Merge Datasets¶

Now we merge our three datasets:

  1. Crime data (wide format): municipality × month panel
  2. Income data: municipality-level cross-section
  3. AEX returns: monthly time series

The goal is to create a municipality-month panel with all variables.

Task 5.1: Start with Crime Panel¶

📝 Task
Load the wide crime data from $processed/crime_wide.dta to use as the base dataset for merging. This will be our "master" dataset since it defines the municipality-month panel structure.
Observations: 57448
Variables: 6
Time period: 01jan2012 to 01nov2025
✅ Task 5.1 tests passed

Task 5.2: Merge Income Data¶

📝 Task
Merge the income data using municipality_code. This is a many-to-one merge (m:1) because:
  • Crime data has many observations per municipality (one per month)
  • Income data has one observation per municipality (cross-section)
Keep only municipalities that appear in both datasets (_merge == 3).
Stata Stata Tip: Merge Types
  • merge m:1 municipality_code using "$raw/income_data.dta"
  • tab _merge — check merge results
  • keep if _merge == 3 — keep only matched
  • drop _merge — remove merge indicator
    Result                      Number of obs
    -----------------------------------------
    Not matched                         1,187
        from master                     1,169  (_merge==1)
        from using                         18  (_merge==2)

    Matched                            56,279  (_merge==3)
    -----------------------------------------

   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
        Master only (1) |      1,169        2.03        2.03
         Using only (2) |         18        0.03        2.07
            Matched (3) |     56,279       97.93      100.00
------------------------+-----------------------------------
                  Total |     57,466      100.00
(1,187 observations deleted)
Observations after merge: 56279

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avg_median~e |     56,279    23.50222    1.863046   19.83227   36.45822
✅ Task 5.2 tests passed

Task 5.3: Create Year-Month Variable for Merging¶

📝 Task
Create a year_month variable in the crime dataset to match the one in the AEX returns dataset. This will be the merge key for the time series merge.
---- CHECKPOINT: Year-month variable created ----
Unique months: 167

     +---------------------------------+
     | munic~de        date   year_m~h |
     |---------------------------------|
  1. | GM0014     01jan2012     2012m1 |
  2. | GM0014     01feb2012     2012m2 |
  3. | GM0014     01mar2012     2012m3 |
  4. | GM0014     01apr2012     2012m4 |
  5. | GM0014     01may2012     2012m5 |
     +---------------------------------+
✅ Task 5.3 tests passed

Task 5.4: Merge AEX Returns¶

📝 Task
Merge the monthly AEX returns using year_month. This is also a many-to-one merge (m:1) because:
  • Crime data has many observations per month (one per municipality)
  • Returns data has one observation per month (time series)
Merge using $processed/aex_monthly.dta. Keep only matched observations.
    Result                      Number of obs
    -----------------------------------------
    Not matched                        19,895
        from master                    19,883  (_merge==1)
        from using                         12  (_merge==2)

    Matched                            36,396  (_merge==3)
    -----------------------------------------

   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
        Master only (1) |     19,883       35.32       35.32
         Using only (2) |         12        0.02       35.34
            Matched (3) |     36,396       64.66      100.00
------------------------+-----------------------------------
                  Total |     56,291      100.00
(19,895 observations deleted)
---- CHECKPOINT: AEX returns merged ----
Observations after merge: 36396

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
ret_compound |     36,396    .0072072    .0393381  -.1037125   .1351352
ret_std_mo~y |     36,396    .6468432    4.149869  -12.38631    8.55268
✅ Task 5.4 tests passed

Task 5.5: Verify Panel Structure and Save¶

📝 Task
Declare the panel structure using xtset and save the merged analysis panel to $processed/analysis_panel.dta. Use xtdescribe to verify the panel is balanced and check the time coverage.
Stata Stata Tip: Panel Structure
  • encode municipality_code, gen(muni_id) — create numeric municipality ID
  • xtset muni_id year_month — declare panel structure
  • xtdescribe — verify panel balance
  • save "$processed/analysis_panel.dta", replace — save the merged panel
Panel variable: muni_id (strongly balanced)
 Time variable: year_month, 2012m1 to 2020m12
         Delta: 1 month

 muni_id:  1, 2, ..., 337                                    n =        337
year_month:  2012m1, 2012m2, ..., 2020m12                    T =        108
           Delta(year_month) = 1 month
           Span(year_month)  = 108 periods
           (muni_id*year_month uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                       108     108     108       108       108     108     108

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+--------------------------------------------------
> ------------------------------------------------------------
      337    100.00  100.00 |  111111111111111111111111111111111111111111111111
> 111111111111111111111111111111111111111111111111111111111111
 ---------------------------+--------------------------------------------------
> ------------------------------------------------------------
      337    100.00         |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/06-assignment/data/processed/analysis_panel.dta saved
✅ Merged analysis panel saved to: /Users/casparm4/Github/rsm-data-analytics-in-
> finance-private/private/assignments/06-assignment/data/processed/analysis_pan
> el.dta
Municipalities: 337
Time periods: .
Total observations: 36396
✅ Task 5.5 tests passed

Section 6: Create Analysis Variables¶

Following Huck (2024), we test whether stock returns have differential effects on crime across income groups.

To do this, we need to:

  1. Create income terciles (high, medium, low income municipalities)
  2. Generate interaction terms (returns × income terciles)
  3. Create crime rate variables (per capita)
  4. Verify the variables are ready for regression

This section prepares all variables needed for the panel regressions in Part B.

Task 6.1: Load Analysis Panel¶

📝 Task
Load the merged analysis panel from $processed/analysis_panel.dta and verify the panel structure is intact.
Panel variable: muni_id (strongly balanced)
 Time variable: year_month, 2012m1 to 2020m12
         Delta: 1 month
✅ Task 6.1 tests passed

Task 6.2: Create Income Terciles¶

📝 Task
Create income tercile groups by splitting municipalities into three equal groups based on average median income:
  • Tercile 1: Low-income municipalities (bottom 33%)
  • Tercile 2: Medium-income municipalities (middle 33%)
  • Tercile 3: High-income municipalities (top 33%)
Use xtile to create the tercile variable.
Stata Stata Tip
  • xtile income_tercile = avg_median_std_income, nq(3) — create terciles
  • tab income_tercile — verify distribution
  • bysort income_tercile: summarize avg_median_std_income — check income ranges
       Income |
      tercile |
      (1=low, |
       2=mid, |
      3=high) |      Freq.     Percent        Cum.
--------------+-----------------------------------
   Low income |     12,204       33.53       33.53
Medium income |     12,096       33.23       66.77
  High income |     12,096       33.23      100.00
--------------+-----------------------------------
        Total |     36,396      100.00

-------------------------------------------------------------------------------
-> income_tercile = Low income

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avg_median~e |     12,204     21.7271    .7816501   19.83227   22.80175

-------------------------------------------------------------------------------
-> income_tercile = Medium income

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avg_median~e |     12,096    23.34742    .2878872   22.82476   23.89005

-------------------------------------------------------------------------------
-> income_tercile = High income

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avg_median~e |     12,096      25.448    1.659939   23.91478   36.45822

---- CHECKPOINT: Income terciles created ----
Tercile groups: 3
✅ Task 6.2 tests passed

Task 6.3: Create Tercile Dummy Variables¶

📝 Task
Create dummy (0/1) variables for each income tercile. These will be used to create interaction terms with returns:
  • low_income: 1 if tercile 1, 0 otherwise
  • mid_income: 1 if tercile 2, 0 otherwise
  • high_income: 1 if tercile 3, 0 otherwise
✅ Task 6.3 tests passed

Task 6.4: Create Interaction Terms¶

📝 Task
Create interaction terms between standardized monthly returns and income tercile dummies. These interactions allow us to test whether the effect of returns on crime differs across income groups:
  • ret_x_low = ret_std_monthly × low_income
  • ret_x_mid = ret_std_monthly × mid_income
  • ret_x_high = ret_std_monthly × high_income
ℹ️ Why Interactions?
The Huck (2024) hypothesis is that returns affect crime differently in high-income vs low-income areas. A simple regression of crime on returns would only tell us the average effect.

By including interaction terms, we estimate separate effects:

  • β₁: Effect of returns in low-income areas
  • β₂: Effect of returns in medium-income areas
  • β₃: Effect of returns in high-income areas

    We expect β₃ < 0 (returns ↓ crime in high-income areas) and β₁ > 0 (returns ↑ crime in low-income areas).

        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
       ret_x_low |     36,396     .216894    2.422353  -12.38631    8.55268
       ret_x_mid |     36,396    .2149746    2.411696  -12.38631    8.55268
      ret_x_high |     36,396    .2149746    2.411696  -12.38631    8.55268
    ---- CHECKPOINT: Interaction terms created ----
    Interactions ready for regression analysis
    
    ✅ Task 6.4 tests passed
    

    Task 6.5: Save Final Analysis Dataset¶

    📝 Task
    Verify all required variables are present and save the final analysis dataset save "$processed/analysis_panel.dta", replace. Use describe and summarize to document what's in the final dataset.
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    -------------------------------------------------------------------------------
    municipality~de str6    %-9s                  Municipality Code
    municipality~me str29   %-9s                  Municipality Name
    date            double  %td                   Date (Year-Month)
    year_month      float   %tm                   Year-month
    total_crimes    long    %12.0g                Total crimes reported
    assault_crimes  long    %12.0g                Assault crimes reported
    murder_crimes   long    %12.0g                Murder/manslaughter crimes
                                                    reported
    avg_median_st~e double  %10.0g                Average Median Standardized
                                                    Income
    income_tercile  byte    %13.0g     tercile_lbl
                                                  Income tercile (1=low, 2=mid,
                                                    3=high)
    low_income      float   %9.0g                 Low income tercile (1=yes, 0=no)
    mid_income      float   %9.0g                 Medium income tercile (1=yes,
                                                    0=no)
    high_income     float   %9.0g                 High income tercile (1=yes, 0=no)
    ret_compound    float   %9.0g                 Monthly compound return
    ret_std_monthly float   %9.0g                 Monthly standardized return (sum
                                                    of daily std ret)
    ret_x_low       float   %9.0g                 Standardized return × Low income
    ret_x_mid       float   %9.0g                 Standardized return × Medium
                                                    income
    ret_x_high      float   %9.0g                 Standardized return × High income
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
    total_crimes |     36,396     223.713     568.559          0       9239
    assault_cr~s |     36,396    11.27756    28.59991          0        463
    murder_cri~s |     36,396    .7485163      2.4274          0         46
    avg_median~e |     36,396    23.50222    1.863055   19.83227   36.45822
    ret_std_mo~y |     36,396    .6468432    4.149869  -12.38631    8.55268
    -------------+---------------------------------------------------------
       ret_x_low |     36,396     .216894    2.422353  -12.38631    8.55268
       ret_x_mid |     36,396    .2149746    2.411696  -12.38631    8.55268
      ret_x_high |     36,396    .2149746    2.411696  -12.38631    8.55268
    file
        /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
        > gnments/06-assignment/data/processed/analysis_panel.dta saved
    Observations: 36396
    Variables: 20
    File saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/privat
    > e/assignments/06-assignment/data/processed/analysis_panel.dta
    
    ✅ Part A: Data Preparation COMPLETE
     Dataset ready for regression analysis in Part B
    
    ✅ Task 6.5 tests passed
    

    Part A Summary¶

    Congratulations! You have completed Part A: Data Preparation.

    What you accomplished:

    1. ✅ Section 1: Understood the research question from Huck (2024)
    2. ✅ Section 2: Loaded three raw datasets (crime, income, AEX)
    3. ✅ Section 3: Calculated monthly standardized returns following published methodology
    4. ✅ Section 4: Reshaped crime data from long to wide format
    5. ✅ Section 5: Merged datasets into municipality-month panel
    6. ✅ Section 6: Created analysis variables (terciles, interactions)

    Final dataset: $processed/analysis_panel.dta

    Panel structure:

    • Unit: Municipality (muni_id)
    • Time: Year-month (year_month)
    • Observations: ~350 municipalities × ~96 months

    Key variables for Part B:

    • Outcome: total_crimes, assault_crimes, murder_crimes
    • Treatment: ret_std_monthly (standardized monthly return)
    • Moderator: income_tercile, low_income, mid_income, high_income
    • Interactions: ret_x_low, ret_x_mid, ret_x_high

    Next steps (Part B):

    • Descriptive statistics and visualizations
    • Panel regression with fixed effects
    • Testing differential effects across income groups
    • Robustness checks

    Your thesis relciation will follow this exact workflow:

    1. Find published paper with interesting result
    2. Obtain raw data (harder than it looks!)
    3. Clean and merge datasets (60-80% of your time)
    4. Replicate baseline result
    5. Test extension or alternative mechanism
    6. Discuss limitations honestly

    Part B: Analysis¶

    Now that we have prepared the data, we can analyze the relationship between stock returns and crime across income groups.

    Section 7: Descriptive Statistics¶

    Before running regressions, we examine the data through summary statistics and understand the distribution of our key variables.

    Following standard empirical practice, we create:

    1. Summary statistics by income tercile (like Huck's Table 2)
    2. Return distribution statistics
    3. A formatted table for presentation

    Task 7.1: Summary Statistics by Income Tercile¶

    📝 Task
    Load the analysis panel and create summary statistics for crime variables broken down by income tercile. This shows whether crime levels differ systematically across low, medium, and high-income municipalities.
    Stata Stata Tip
    • use "$processed/analysis_panel.dta", clear — load the final panel
    • tabstat total_crimes assault_crimes murder_crimes, by(income_tercile) stat(n mean sd min max)
    • This shows N, mean, SD, min, max for each income group
    Panel variable: muni_id (strongly balanced)
     Time variable: year_month, 2012m1 to 2020m12
             Delta: 1 month
    
    Summary for variables: total_crimes assault_crimes murder_crimes
    Group variable: income_tercile (Income tercile (1=low, 2=mid, 3=high))
    
    income_tercile |         N      Mean        SD       Min       Max
    ---------------+--------------------------------------------------
        Low income |  12204.00    451.41    927.86     12.00   9239.00
                   |  12204.00     23.77     46.21      0.00    463.00
                   |  12204.00      1.63      3.93      0.00     46.00
    ---------------+--------------------------------------------------
     Medium income |  12096.00    131.25    132.65      0.00   1159.00
                   |  12096.00      6.37      7.19      0.00     58.00
                   |  12096.00      0.39      0.78      0.00      8.00
    ---------------+--------------------------------------------------
       High income |  12096.00     86.45     82.52      1.00    964.00
                   |  12096.00      3.58      3.83      0.00     39.00
                   |  12096.00      0.22      0.55      0.00      8.00
    ---------------+--------------------------------------------------
             Total |  36396.00    223.71    568.56      0.00   9239.00
                   |  36396.00     11.28     28.60      0.00    463.00
                   |  36396.00      0.75      2.43      0.00     46.00
    ------------------------------------------------------------------
    
    ✅ Task 7.1 tests passed
    

    Task 7.2: Summary Statistics for Returns¶

    📝 Task
    Examine the distribution of both return measures (compound and standardized). Use summarize, detail to see the full distribution including percentiles, skewness, and kurtosis. This helps identify outliers and understand the variation in market returns over the sample period.
    Compound Monthly Returns:
    
                       Monthly compound return
    -------------------------------------------------------------
          Percentiles      Smallest
     1%     -.101367      -.1037125
     5%    -.0590024      -.1037125
    10%    -.0470165      -.1037125       Obs              36,396
    25%    -.0185647      -.1037125       Sum of wgt.      36,396
    
    50%     .0115142                      Mean           .0072072
                            Largest       Std. dev.      .0393381
    75%     .0340431       .1351352
    90%     .0509977       .1351352       Variance       .0015475
    95%     .0623475       .1351352       Skewness      -.1718706
    99%     .0973073       .1351352       Kurtosis       3.780191
    
    Standardized Monthly Returns:
    
         Monthly standardized return (sum of daily std ret)
    -------------------------------------------------------------
          Percentiles      Smallest
     1%    -9.791107      -12.38631
     5%    -7.349111      -12.38631
    10%     -4.55779      -12.38631       Obs              36,396
    25%    -1.753273      -12.38631       Sum of wgt.      36,396
    
    50%     1.219513                      Mean           .6468432
                            Largest       Std. dev.      4.149869
    75%      3.87236        8.55268
    90%     5.401238        8.55268       Variance       17.22141
    95%     6.930266        8.55268       Skewness       -.625603
    99%     7.930677        8.55268       Kurtosis       3.264927
    
    ✅ Task 7.2 tests passed
    

    Task 7.3: Create Formatted Summary Table¶

    📝 Task
    Create a publication-quality summary statistics table and export it to LaTeX format. This table will show:
    • Crime variables by income tercile
    • Income and return variables for the full sample
    • Mean, SD, Min, Max, and N for each variable
    The table will be saved as $tables/summary_stats.tex for use in reports.
    Stata Stata Tip: Creating LaTeX Tables
    • estpost tabstat — prepare statistics for export
    • esttab using "filename.tex" — export to LaTeX
    • Use cells("mean(fmt(2)) sd(fmt(2)) min max count") to specify statistics
    • Add label option to use variable labels instead of names
    • Export as two separate files:
      • summary_stats_panelA.tex — crime statistics by income tercile (use by(income_tercile))
      • summary_stats_panelB.tex — full sample statistics
    (Note: Below code run with echo to enable preserve/restore functionality.)
    
    . di as txt ""
    
    
    . di as txt "Creating Panel A: Crime Statistics by Income Tercile..."
    Creating Panel A: Crime Statistics by Income Tercile...
    
    . preserve
    
    . estpost tabstat total_crimes assault_crimes murder_crimes, by(income_tercile)
    >  statistics(mean sd min max count) columns(statistics)
    
    Summary statistics: mean sd min max count
         for variables: total_crimes assault_crimes murder_crimes
      by categories of: income_tercile
    
    income_terci |   e(mean)      e(sd)     e(min)     e(max)   e(count) 
    -------------+-------------------------------------------------------
    Low income   |                                                       
    total_crimes |  451.4092   927.8566         12       9239      12204 
    assault_cr~s |   23.7718   46.20558          0        463      12204 
    murder_cri~s |  1.633972   3.933833          0         46      12204 
    -------------+-------------------------------------------------------
    Medium inc~e |                                                       
    total_crimes |  131.2493   132.6499          0       1159      12096 
    assault_cr~s |  6.371197   7.192719          0         58      12096 
    murder_cri~s |  .3865741   .7776529          0          8      12096 
    -------------+-------------------------------------------------------
    High income  |                                                       
    total_crimes |   86.4475   82.52387          1        964      12096 
    assault_cr~s |  3.578125   3.831013          0         39      12096 
    murder_cri~s |  .2170966   .5548765          0          8      12096 
    -------------+-------------------------------------------------------
    Total        |                                                       
    total_crimes |   223.713    568.559          0       9239      36396 
    assault_cr~s |  11.27756   28.59991          0        463      36396 
    murder_cri~s |  .7485163     2.4274          0         46      36396 
    
    . esttab using "$tables/summary_stats_panelA.tex", replace cells("mean(fmt(2)) 
    > sd(fmt(2)) min(fmt(0)) max(fmt(0)) count(fmt(0))") noobs nonumber nomtitle ti
    > tle("Panel A: Crime Statistics by Income Tercile")
    (output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
    > /private/assignments/06-assignment/output/tables/summary_stats_panelA.tex)
    
    . restore
    
    . di as txt "Creating Panel B: Full Sample Statistics..."
    Creating Panel B: Full Sample Statistics...
    
    . estpost summarize total_crimes assault_crimes murder_crimes avg_median_std_in
    > come ret_compound ret_std_monthly
    
                 |  e(count)   e(sum_w)    e(mean)     e(Var)      e(sd) 
    -------------+-------------------------------------------------------
    total_crimes |     36396      36396    223.713   323259.3    568.559 
    assault_cr~s |     36396      36396   11.27756   817.9551   28.59991 
    murder_cri~s |     36396      36396   .7485163    5.89227     2.4274 
    avg_median~e |     36396      36396   23.50222   3.470975   1.863055 
    ret_compound |     36396      36396   .0072072   .0015475   .0393381 
    ret_std_mo~y |     36396      36396   .6468432   17.22141   4.149869 
    
                 |    e(min)     e(max)     e(sum) 
    -------------+---------------------------------
    total_crimes |         0       9239    8142258 
    assault_cr~s |         0        463     410458 
    murder_cri~s |         0         46      27243 
    avg_median~e |  19.83227   36.45822   855386.9 
    ret_compound | -.1037125   .1351352   262.3129 
    ret_std_mo~y | -12.38631    8.55268    23542.5 
    
    . esttab using "$tables/summary_stats_panelB.tex", replace cells("mean(fmt(2)) 
    > sd(fmt(2)) min(fmt(2)) max(fmt(2)) count(fmt(0))") noobs nonumber nomtitle ti
    > tle("Panel B: Full Sample Statistics") label
    (output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
    > /private/assignments/06-assignment/output/tables/summary_stats_panelB.tex)
    
    . di as txt "---- CHECKPOINT: Summary tables exported ----"
    ---- CHECKPOINT: Summary tables exported ----
    
    . di as result "✅ Panel A saved: $tables/summary_stats_panelA.tex"
    ✅ Panel A saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/p
    > rivate/assignments/06-assignment/output/tables/summary_stats_panelA.tex
    
    . di as result "✅ Panel B saved: $tables/summary_stats_panelB.tex"
    ✅ Panel B saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/p
    > rivate/assignments/06-assignment/output/tables/summary_stats_panelB.tex
    
    . 
    
    ✅ Task 7.3 tests passed
    
    ℹ️ What to Look For in Descriptive Statistics

    When examining summary statistics, pay attention to:

    Crime levels across terciles:

    • Do high-income areas have more or less crime than low-income areas?
    • This affects interpretation—if high-income has zero crime, returns cannot reduce it further

    Return distribution:

    • Mean compound return tells us average market performance (should be slightly positive)
    • SD shows volatility—higher SD means more variation to identify effects
    • Standardized returns should have mean ≈ 0 and SD ≈ 1 by construction

    Sample balance:

    • Each tercile should have roughly N/3 observations
    • Large imbalances suggest issues with tercile creation

    Section 8: Visualize Crime and Returns Over Time¶

    Visual inspection is an essential first step before formal statistical tests. We create three plots:

    1. National crime trends — How has total crime evolved over time?
    2. Crime and returns overlay — Do they move together visually?
    3. Crime by income tercile — Do trends differ across income groups?

    These plots provide intuition before we estimate regressions.

    Task 8.1: National Crime Time Series¶

    📝 Task
    Create a time series plot showing total crime aggregated to the national level. Collapse the panel data by summing crime across all municipalities for each month, then plot the national total over time. This shows the overall crime trend in the Netherlands during the sample period. Save the figure as crime_timeseries.png.
    Stata Stata Tip: Creating Time Series Plots
    • preserve — save current dataset state
    • collapse (sum) total_crimes, by(year_month) — aggregate to national level
    • twoway line total_crimes year_month — create line plot
    • graph export "$figures/crime_timeseries.png", replace width(1200) — save figure
    • restore — return to original dataset
    (Note: Below code run with echo to enable preserve/restore functionality.)
    
    . preserve
    
    . collapse (sum) total_crimes assault_crimes murder_crimes, by(year_month)
    
    . twoway (line total_crimes year_month, lcolor(navy) lwidth(medium)), title("To
    > tal Crime in the Netherlands", size(medium)) subtitle("Monthly aggregates acr
    > oss all municipalities", size(small)) xtitle("Year-Month") ytitle("Total Crim
    > es Reported") xlabel(, angle(45) labsize(small)) ylabel(, labsize(small) form
    > at(%12.0fc)) graphregion(color(white)) plotregion(color(white)) note("Source:
    >  CBS/Politie crime data, 2012-2019", size(vsmall))
    
    . graph export "$figures/crime_timeseries.png", replace width(1200)
    file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
    > nments/06-assignment/output/figures/crime_timeseries.png written in PNG forma
    > t
    
    . di as txt "---- CHECKPOINT: National crime time series ----"
    ---- CHECKPOINT: National crime time series ----
    
    . di as result "✅ Figure saved: $figures/crime_timeseries.png"
    ✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
    > ivate/assignments/06-assignment/output/figures/crime_timeseries.png
    
    . di as result "Months plotted: " _N
    Months plotted: 108
    
    . restore
    
    . 
    
    No description has been provided for this image
    ✅ Task 8.1 tests passed
    

    Task 8.2: Overlay Crime and Returns¶

    📝 Task
    Create a dual-axis plot showing both crime and stock returns over time. To make them visually comparable, standardize both series (subtract mean, divide by SD). This allows visual inspection of co-movement: when returns are high, is crime higher or lower?

    For the standardization, create variables named crime_z and ret_z (z-scores for crime and returns respectively). Add a horizontal reference line at zero, and include a legend labelling the two series as "Total Crime (standardized)" and "AEX Returns (standardized)". Save the figure as crime_returns_overlay.png.

    ℹ️ Why Standardize for Visualization?
    Crime and returns have different units and scales:
    • Crime: counts ranging from ~10,000 to ~50,000 per month
    • Returns: percentages ranging from -10% to +10%

    By standardizing both (z-scores), we can plot them on the same axis and see whether they move together. A standardized value of +1 means "one standard deviation above average" for that series.

    If crime and returns move in opposite directions (negative correlation), we should see peaks in one series align with troughs in the other.

    Stata Stata Tip
    • preserve — save current dataset state
    • collapse (sum) total_crimes (mean) ret_std_monthly, by(year_month) — aggregate to national level
    • egen mean_crime = mean(total_crimes) — calculate mean for z-score
    • egen sd_crime = sd(total_crimes) — calculate SD for z-score
    • gen crime_z = (total_crimes - mean_crime) / sd_crime — standardize
    • Repeat for returns: create mean_ret, sd_ret, and ret_z
    • twoway (line crime_z year_month) (line ret_z year_month) — overlay both series
    • graph export "$figures/crime_returns_overlay.png", replace width(1200)
    • restore — return to original dataset
    (Note: Below code run with echo to enable preserve/restore functionality.)
    
    . preserve
    
    . collapse (sum) total_crimes (mean) ret_std_monthly, by(year_month)
    
    . egen mean_crime = mean(total_crimes)
    
    . egen sd_crime = sd(total_crimes)
    
    . gen crime_z = (total_crimes - mean_crime) / sd_crime
    
    . egen mean_ret = mean(ret_std_monthly)
    
    . egen sd_ret = sd(ret_std_monthly)
    
    . gen ret_z = (ret_std_monthly - mean_ret) / sd_ret
    
    . twoway (line crime_z year_month, lcolor(maroon) lwidth(medium) lpattern(solid
    > )) (line ret_z year_month, lcolor(navy) lwidth(medium) lpattern(dash)), title
    > ("Crime and Stock Returns Over Time", size(medium)) subtitle("Standardized (z
    > -scores) for visual comparison", size(small)) xtitle("Year-Month") ytitle("St
    > andard Deviations from Mean") xlabel(, angle(45) labsize(small)) ylabel(, lab
    > size(small)) legend(label(1 "Total Crime (standardized)") label(2 "AEX Return
    > s (standardized)") rows(1) position(6) size(small)) graphregion(color(white))
    >  plotregion(color(white)) yline(0, lcolor(gray) lpattern(dot)) note("Both ser
    > ies standardized to mean=0, SD=1 for comparability", size(vsmall))
    
    . graph export "$figures/crime_returns_overlay.png", replace width(1200)
    file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
    > nments/06-assignment/output/figures/crime_returns_overlay.png written in PNG 
    > format
    
    . di as txt "---- CHECKPOINT: Crime and returns overlay ----"
    ---- CHECKPOINT: Crime and returns overlay ----
    
    . di as result "✅ Figure saved: $figures/crime_returns_overlay.png"
    ✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
    > ivate/assignments/06-assignment/output/figures/crime_returns_overlay.png
    
    . restore
    
    . 
    
    No description has been provided for this image
    ✅ Task 8.2 tests passed
    

    Task 8.3: Crime Time Series by Income Tercile¶

    📝 Task
    Create a plot showing crime trends separately for low, medium, and high-income municipalities. This visualizes the key hypothesis: if Huck's theory holds, crime in high-income areas should move differently than crime in low-income areas when market returns change.

    First collapse the data to the tercile-month level (sum of total_crimes by year_month and income_tercile). Include a legend with labels "Low Income (Tercile 1)", "Medium Income (Tercile 2)", and "High Income (Tercile 3)". Save the figure as crime_by_income.png.

    Stata Stata Tip
    • preserve
    • collapse (sum) total_crimes, by(year_month income_tercile)
    • Use twoway with three line plots, each filtered by if income_tercile==1, ==2, ==3
    • graph export "$figures/crime_by_income.png", replace width(1200)
    • restore
    (Note: Below code run with echo to enable preserve/restore functionality.)
    
    . preserve
    
    . collapse (sum) total_crimes, by(year_month income_tercile)
    
    . twoway (line total_crimes year_month if income_tercile==1, lcolor(red) lwidth
    > (medium) lpattern(solid)) (line total_crimes year_month if income_tercile==2,
    >  lcolor(orange) lwidth(medium) lpattern(dash)) (line total_crimes year_month 
    > if income_tercile==3, lcolor(green) lwidth(medium) lpattern(longdash)), title
    > ("Crime Trends by Income Tercile", size(medium)) subtitle("Total crime aggreg
    > ated by municipality income group", size(small)) xtitle("Year-Month") ytitle(
    > "Total Crimes Reported") xlabel(, angle(45) labsize(small)) ylabel(, labsize(
    > small) format(%12.0fc)) legend(label(1 "Low Income (Tercile 1)") label(2 "Med
    > ium Income (Tercile 2)") label(3 "High Income (Tercile 3)") rows(1) position(
    > 6) size(small)) graphregion(color(white)) plotregion(color(white)) note("Sour
    > ce: CBS/Politie crime data by municipality income tercile", size(vsmall))
    
    . graph export "$figures/crime_by_income.png", replace width(1200)
    file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
    > nments/06-assignment/output/figures/crime_by_income.png written in PNG format
    
    . di as txt "---- CHECKPOINT: Crime by income tercile ----"
    ---- CHECKPOINT: Crime by income tercile ----
    
    . di as result "✅ Figure saved: $figures/crime_by_income.png"
    ✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
    > ivate/assignments/06-assignment/output/figures/crime_by_income.png
    
    . restore
    
    . 
    
    No description has been provided for this image
    ✅ Task 8.3 tests passed
    
    🚀 Interpreting the Visualizations
    After creating these three plots, examine them carefully:

    Figure 1 (National trend): Is crime increasing, decreasing, or stable? Are there seasonal patterns?

    Figure 2 (Overlay): Do crime and returns move together (positive correlation) or in opposite directions (negative correlation)? Remember: we expect crime to decrease when returns are high (negative relationship).

    Figure 3 (By tercile): Do the three income groups have similar levels and trends? High-income areas typically have lower crime levels. If high-income municipalities have near-zero crime, it may be difficult to detect a relationship with returns (floor effect).

    Important: Visual patterns are suggestive but not conclusive. We need regression analysis (next sections) to control for confounders and test statistical significance.


    Section 9: Baseline Panel Regression¶

    We now estimate our first formal statistical tests. The baseline regressions estimate the average effect of stock returns on crime across all municipalities, without distinguishing by income group.

    Specification:

    $$\text{Crime}_{it} = \beta_0 + \beta_1 \text{Returns}_t + \alpha_i + \varepsilon_{it}$$

    Where:

    • $\text{Crime}_{it}$ = Total crimes in municipality $i$ at month $t$
    • $\text{Returns}_t$ = Monthly AEX return (compound or standardized)
    • $\alpha_i$ = Municipality fixed effects (absorbs time-invariant municipality characteristics)
    • $\varepsilon_{it}$ = Error term (clustered at municipality level)

    Key question: Is $\beta_1$ positive or negative? Does higher return correlate with more or less crime on average?

    Task 9.1: Load Analysis Panel and Verify Structure¶

    📝 Task
    Load the analysis panel from $processed/analysis_panel.dta and verify the panel structure is correctly set. Check that we have the right number of municipalities and time periods before running regressions.
    Panel variable: muni_id (strongly balanced)
     Time variable: year_month, 2012m1 to 2020m12
             Delta: 1 month
    
     muni_id:  1, 2, ..., 337                                    n =        337
    year_month:  2012m1, 2012m2, ..., 2020m12                    T =        108
               Delta(year_month) = 1 month
               Span(year_month)  = 108 periods
               (muni_id*year_month uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                           108     108     108       108       108     108     108
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+--------------------------------------------------
    > ------------------------------------------------------------
          337    100.00  100.00 |  111111111111111111111111111111111111111111111111
    > 111111111111111111111111111111111111111111111111111111111111
     ---------------------------+--------------------------------------------------
    > ------------------------------------------------------------
          337    100.00         |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    > XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    ---- CHECKPOINT: Panel structure verified ----
    Panel variable: muni_id
    Time variable: year_month
    Total observations: 36396
    
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    -------------------------------------------------------------------------------
    total_crimes    long    %12.0g                Total crimes reported
    ret_compound    float   %9.0g                 Monthly compound return
    ret_std_monthly float   %9.0g                 Monthly standardized return (sum
                                                    of daily std ret)
    income_tercile  byte    %13.0g     tercile_lbl
                                                  Income tercile (1=low, 2=mid,
                                                    3=high)
    
    ✅ Task 9.1 tests passed
    

    Task 9.2: Estimate Baseline with Compound Returns¶

    📝 Task
    Estimate the baseline panel regression using compound monthly returns as the independent variable. Use municipality fixed effects and cluster standard errors at the municipality level. Interpret the coefficient: what does it mean economically?
    Stata Stata Tip: Panel Regression
    • xtreg depvar indepvar, fe — fixed effects regression
    • vce(cluster muni_id) — cluster standard errors by municipality
    • estimates store modelname — save estimates for later comparison
    • Coefficient interpretation: A 1 percentage point increase in returns → change of β₁ crimes
    ========================================
    Baseline Model: Compound Returns
    ========================================
    
    
    Fixed-effects (within) regression               Number of obs     =     36,396
    Group variable: muni_id                         Number of groups  =        337
    
    R-squared:                                      Obs per group:
         Within  = 0.0002                                         min =        108
         Between =      .                                         avg =      108.0
         Overall = 0.0000                                         max =        108
    
                                                    F(1, 336)         =      36.31
    corr(u_i, Xb) = 0.0000                          Prob > F          =     0.0000
    
                                  (Std. err. adjusted for 337 clusters in muni_id)
    ------------------------------------------------------------------------------
                 |               Robust
    total_crimes | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    ret_compound |   33.71929   5.595892     6.03   0.000     22.71189    44.72668
           _cons |     223.47   .0403306  5540.95   0.000     223.3906    223.5493
    -------------+----------------------------------------------------------------
         sigma_u |  562.89334
         sigma_e |  86.074038
             rho |  .97715169   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    
    ---- CHECKPOINT: Baseline regression (compound) ----
    Coefficient on ret_compound: 33.719288
    Standard error: 5.5958915
    t-statistic: 6.0257223
    
    Interpretation:
     → Positive coefficient: Higher returns associated with MORE crime
    
    ✅ Task 9.2 tests passed
    

    Task 9.3: Estimate Baseline with Standardized Returns¶

    📝 Task
    Estimate the same baseline specification using standardized returns (ret_std_monthly) instead of compound returns. Use the same fixed effects and clustered standard errors as before: xtreg total_crimes ret_std_monthly, fe vce(cluster muni_id). This follows Huck (2024)'s methodology more closely. Compare the results: do both return measures give similar findings about the average relationship? Store the estimates as m1_std.
    ========================================
    Baseline Model: Standardized Returns
    ========================================
    
    
    Fixed-effects (within) regression               Number of obs     =     36,396
    Group variable: muni_id                         Number of groups  =        337
    
    R-squared:                                      Obs per group:
         Within  = 0.0010                                         min =        108
         Between =      .                                         avg =      108.0
         Overall = 0.0000                                         max =        108
    
                                                    F(1, 336)         =      43.37
    corr(u_i, Xb) = -0.0000                         Prob > F          =     0.0000
    
                                  (Std. err. adjusted for 337 clusters in muni_id)
    ------------------------------------------------------------------------------
                 |               Robust
    total_crimes | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    ret_std_mo~y |   .6651736   .1009989     6.59   0.000     .4665037    .8638435
           _cons |   223.2827   .0653305  3417.74   0.000     223.1542    223.4112
    -------------+----------------------------------------------------------------
         sigma_u |  562.89334
         sigma_e |  86.039671
             rho |  .97716951   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    
    ---- CHECKPOINT: Baseline regression (standardized) ----
    Coefficient on ret_std_monthly: .6651736
    Standard error: .10099893
    t-statistic: 6.5859472
    
    Interpretation:
     → A 1 SD increase in returns → change of .6651736 crimes
    
    ✅ Task 9.3 tests passed
    
    ℹ️ Comparing the Two Baseline Models
    You have now estimated the same model with two different return measures:

    Model 1 (Compound Returns):

    • Coefficient is in "crimes per 1 percentage point return"
    • Example: β = -100 means a 1% return reduces crime by 100 incidents
    • Easy to interpret economically (percentage points are intuitive)

      Model 2 (Standardized Returns):

      • Coefficient is in "crimes per 1 standard deviation return"
      • Example: β = -50 means a 1 SD return surprise reduces crime by 50 incidents
      • Captures the "surprise" interpretation (Huck's methodology)
      • Controls for volatility regime changes

        What to expect: Both should show the same sign (positive or negative) but different magnitudes. The standardized version is preferred for comparability to Huck (2024).

        Important limitation: These baseline models estimate the average effect across all income groups. They do not test Huck's key hypothesis about differential effects. That comes in Section 10.


        Section 10: Differential Effects by Income (Core Test)¶

        This is the key test of Huck's hypothesis: Do stock returns affect crime differently in high-income vs. low-income areas?

        We use standardized returns for the interaction analysis, following Huck (2024). The compound return baseline from Section 9 serves as a naive comparison in the final regression table.

        Huck's (2024) finding:

        • High-income areas (investors): Returns ↑ → Crime ↓ (negative coefficient)
        • Low-income areas (non-investors): Returns ↑ → Crime ↑ (positive coefficient)

        Mechanism: Relative wealth/status affects psychological well-being. When the market rises:

        • Investors feel wealthier → better mood → less crime
        • Non-investors feel relatively poorer → worse mood → more crime

        Our specification (with interactions):

        $$\text{Crime}_{it} = \beta_0 + \beta_1 \text{Returns}_t + \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \varepsilon_{it}$$

        Where:

        • $\beta_1$ = Effect in medium-income areas (reference group)
        • $\beta_1 + \beta_2$ = Effect in high-income areas
        • $\beta_1 + \beta_3$ = Effect in low-income areas

        Key test: Is $\beta_2 < 0$ (high-income negative) and $\beta_3 > 0$ (low-income positive)?

        Task 10.1: Estimate Interactions with Standardized Returns¶

        📝 Task
        Estimate the same interaction model using standardized returns. The interaction terms ret_x_high, ret_x_mid, and ret_x_low were already created in Part A (Section 6). Use municipality fixed effects with clustered standard errors: xtreg total_crimes ret_x_low ret_x_mid ret_x_high, fe vce(cluster muni_id). Store the estimates as m2_std. This specification follows Huck (2024)'s methodology exactly.
        Variable      Storage   Display    Value
            name         type    format    label      Variable label
        -------------------------------------------------------------------------------
        ret_x_high      float   %9.0g                 Standardized return × High income
        ret_x_mid       float   %9.0g                 Standardized return × Medium
                                                        income
        ret_x_low       float   %9.0g                 Standardized return × Low income
        
        ========================================
        Interaction Model: Standardized Returns
        ========================================
        
        
        Fixed-effects (within) regression               Number of obs     =     36,396
        Group variable: muni_id                         Number of groups  =        337
        
        R-squared:                                      Obs per group:
             Within  = 0.0015                                         min =        108
             Between = 0.0835                                         avg =      108.0
             Overall = 0.0010                                         max =        108
        
                                                        F(3, 336)         =      55.37
        corr(u_i, Xb) = 0.0256                          Prob > F          =     0.0000
        
                                      (Std. err. adjusted for 337 clusters in muni_id)
        ------------------------------------------------------------------------------
                     |               Robust
        total_crimes | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
           ret_x_low |    1.30699   .2843909     4.60   0.000     .7475796    1.866401
           ret_x_mid |   .4399819   .0600953     7.32   0.000     .3217714    .5581924
          ret_x_high |   .2428179   .0254009     9.56   0.000     .1928532    .2927826
               _cons |   223.2827   .0632572  3529.76   0.000     223.1583    223.4072
        -------------+----------------------------------------------------------------
             sigma_u |  562.80676
             sigma_e |   86.02041
                 rho |  .97717264   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        
        ---- CHECKPOINT: Interaction model (standardized) ----
        
        Effect by income group (per 1 SD return):
         Low-income: 1.3069905
         Mid-income: .43998192
         High-income: .24281791
        
        ✗ Both positive: Returns increase crime in all groups
        
        ✅ Task 10.2 tests passed
        

        Task 10.2: Interpret the Differential Effects¶

        📝 Task
        Compare the estimated coefficients across income groups and answer the following questions in the markdown cell below:
        1. Are the coefficients for high-income and low-income areas statistically significant?
        2. What are the signs of the coefficients? Do they match Huck's findings?
        3. What is the economic magnitude? (How many crimes per SD of return?)
        4. Why might the pattern differ from Huck's US findings?
        ℹ️ Interpreting Interaction Coefficients
        What Huck (2024) found in the US:
        Using daily data from 2,700 police agencies (1991-2015):
        • High-income areas (investors): A 1 SD return increase → 37 bps decrease in violent crime (−12.5 crimes per 100M population)
        • Low-income areas (noninvestors): A 1 SD return increase → 25 bps increase in violent crime (+15.2 crimes per 100M population)
        • The difference between high and low income coefficients is statistically significant (t = 3.31)

        What you should look for:

        • Sign: Is high-income negative and low-income positive?
        • Significance: Are t-statistics > 2 (roughly 5% significance)?
        • Magnitude: Are effects economically meaningful?
        • Difference: Is the high-low gap statistically significant?

          Why results might differ:

          • Monthly vs daily data (timing mechanism lost)
          • Different country context (NL vs US investor participation rates)
          • Different time period (2012-2019 vs 1991-2015)
          • Income terciles are imperfect proxy for investor status
          • Different crime types/reporting systems


            Section 11: Two-Way Fixed Effects¶

            So far we have used one-way fixed effects (municipality FE only). This controls for time-invariant municipality characteristics (e.g., urban vs rural, demographics) but does not control for time shocks that affect all municipalities simultaneously.

            Two-way fixed effects (TWFE) adds time fixed effects to absorb common shocks:

            $$\text{Crime}_{it} = \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \gamma_t + \varepsilon_{it}$$

            Where:

            • $\alpha_i$ = Municipality fixed effects
            • $\gamma_t$ = Time (year-month) fixed effects
            • Note: The main return effect $\beta_1 \text{Returns}_t$ is absorbed by time FE because all municipalities have the same return in month $t$

            What time FE controls for:

            • Seasonal crime patterns (more crime in summer?)
            • National policy changes (new policing strategies)
            • Economic conditions (unemployment, GDP growth)
            • Any factor that affects all municipalities equally in a given month

            Key advantage: Better identification of differential effects across income groups by removing confounding time trends.

            Task 11.1: Estimate TWFE with reghdfe¶

            📝 Task
            Estimate the interaction model with two-way fixed effects using reghdfe. This package efficiently handles high-dimensional fixed effects (many municipalities × many months). Include only the interaction terms, as the main return effect is absorbed by time FE.
            Stata Stata Tip: reghdfe
            • reghdfe depvar indepvars, absorb(fe1 fe2) vce(cluster var)
            • absorb(muni_id year_month) — absorbs both municipality and time FE
            • Do NOT include main return effect (it's collinear with time FE)
            • Only include interactions that vary within time periods (income × returns)
            ========================================
            Two-Way Fixed Effects Model
            ========================================
            
            Note: Main return effect absorbed by time FE
             (all municipalities have same return each month)
            
            (MWFE estimator converged in 2 iterations)
            
            HDFE Linear regression                            Number of obs   =     36,396
            Absorbing 2 HDFE groups                           F(   2,    336) =      11.10
            Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                              R-squared       =     0.9805
                                                              Adj R-squared   =     0.9802
                                                              Within R-sq.    =     0.0006
            Number of clusters (muni_id) =        337         Root MSE        =    79.9471
            
                                          (Std. err. adjusted for 337 clusters in muni_id)
            ------------------------------------------------------------------------------
                         |               Robust
            total_crimes | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
              ret_x_high |   -.197164   .0653383    -3.02   0.003    -.3256876   -.0686404
               ret_x_low |   .8670085   .2910952     2.98   0.003     .2944098    1.439607
                   _cons |   223.5673   .0672407  3324.88   0.000     223.4351    223.6996
            ------------------------------------------------------------------------------
            
            Absorbed degrees of freedom:
            -----------------------------------------------------+
             Absorbed FE | Categories  - Redundant  = Num. Coefs |
            -------------+---------------------------------------|
                 muni_id |       337         337           0    *|
              year_month |       108           1         107     |
            -----------------------------------------------------+
            * = FE nested within cluster; treated as redundant for DoF computation
            
            ---- CHECKPOINT: Two-way FE model ----
            
            Effect by income group (relative to mid-income):
             High-income: -.19716402
             Low-income: .86700855
            
            Fixed effects absorbed:
             Municipalities: .
             Time periods: .
            
            ✅ Task 11.1 tests passed
            

            Task 11.2: Compare One-Way vs Two-Way Fixed Effects¶

            📝 Task
            Display the results side-by-side to compare how coefficients change when adding time fixed effects. Use esttab to create a clean comparison table showing both specifications.
            ========================================
            Comparison: One-Way FE vs Two-Way FE
            ========================================
            
            
            Effect of Standardized Returns on Crime by Income Group
            ----------------------------------------------------
                                          (1)             (2)   
                                   One-Way FE      Two-Way FE   
            ----------------------------------------------------
            Monthly standardiz~m                                
                                                                
            
            Standardized retur~e        1.307***        0.867***
                                      (0.284)         (0.291)   
            
            Standardized retur~c        0.440***                
                                      (0.060)                   
            
            Standardized retur~m        0.243***       -0.197***
                                      (0.025)         (0.065)   
            ----------------------------------------------------
            Observations               36,396          36,396   
            Adjusted R²                 0.001           0.980   
            ----------------------------------------------------
            Standard errors in parentheses
            * p<0.10, ** p<0.05, *** p<0.01
            
            ---- CHECKPOINT: Model comparison ----
            
            Key differences:
             • One-Way FE: Controls for municipality characteristics only
             • Two-Way FE: Controls for municipality + time shocks
            
            Questions to consider:
             1. Do coefficients change substantially?
             2. Do significance levels change?
             3. Which specification is more credible?
            
            ✅ Task 11.2 tests passed
            
            ℹ️ Why Two-Way Fixed Effects?
            What changes with TWFE:
            • Identification: Effects now identified purely from differential responses across income groups (within-month variation)
            • Confounders removed: Any national trend (seasonality, policy changes, economic conditions) is absorbed
            • Main effect absorbed: Cannot estimate average return effect (all municipalities have same return each month)

            How to interpret coefficient changes:

            • Coefficients shrink: One-way FE had omitted variable bias from time trends
            • Coefficients grow: Time FE removed noise, revealing stronger differential effects
            • Signs flip: Confounding was severe (unusual but possible)
            • No change: Time trends were uncorrelated with treatment (rare)

              Which is more credible?
              Two-way FE is generally preferred because:

              • Controls for more confounders
              • Standard in modern applied work (Angrist & Pischke, Cunningham)
              • Necessary when there are common time shocks (which there always are)

                However, TWFE has limitations:

                • Fewer degrees of freedom (absorbing ~96 time periods)
                • Cannot estimate main treatment effect
                • If treatment varies over time AND across units, identification comes solely from differential time trends


                  Section 12: Robustness Checks¶

                  Good empirical work tests whether findings are robust to alternative specifications. If results change dramatically across reasonable alternatives, they may be fragile or driven by specific modeling choices.

                  We test three alternative specifications:

                  1. Log specification — Test whether effects are proportional (% changes) rather than absolute
                  2. Assault crimes only — Focus on violent crime (closer to Huck's dependent variable)
                  3. Volatility instead of returns — Test whether market uncertainty (not direction) affects crime

                  What makes a good robustness check?

                  • Addresses a legitimate alternative hypothesis
                  • Changes one thing at a time (outcome, specification, or sample)
                  • Has a clear interpretation if results differ

                  All robustness checks use the TWFE specification (municipality + time FE) from Section 11.

                  Task 12.1: Log Specification¶

                  📝 Task
                  Re-estimate the TWFE model using log-transformed crime as the dependent variable. This tests whether returns affect crime in percentage terms rather than absolute counts. Create the log variable if it doesn't exist, then estimate the TWFE model with reghdfe using absorb(muni_id year_month) and vce(cluster muni_id). Store the estimates as m4_log. Name the log variable `log_total_crimes` (use `ln(total_crimes + 1)` to handle zeros).
                  ℹ️ Why Log Specification?
                  Levels vs Logs:
                  • Levels: β measures change in crime counts (e.g., 10 more crimes)
                  • Logs: β measures change in percentage (e.g., 2% increase)

                  When logs are preferred:

                  • Effects are proportional to baseline (bigger cities have bigger absolute changes)
                  • Outcome is skewed (few very high-crime municipalities)
                  • Interpretation in % is more natural

                    Technical note: We use ln(crime + 1) to handle zeros (some municipalities have zero assault/murder in some months).

                    ========================================
                    Robustness Check 1: Log Specification
                    ========================================
                    
                    (MWFE estimator converged in 2 iterations)
                    
                    HDFE Linear regression                            Number of obs   =     36,396
                    Absorbing 2 HDFE groups                           F(   2,    336) =       0.54
                    Statistics robust to heteroskedasticity           Prob > F        =     0.5830
                                                                      R-squared       =     0.9676
                                                                      Adj R-squared   =     0.9672
                                                                      Within R-sq.    =     0.0000
                    Number of clusters (muni_id) =        337         Root MSE        =     0.1909
                    
                                                  (Std. err. adjusted for 337 clusters in muni_id)
                    ------------------------------------------------------------------------------
                                 |               Robust
                    log_total_~s | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                      ret_x_high |   .0001186   .0005546     0.21   0.831    -.0009723    .0012095
                       ret_x_low |  -.0003916   .0004915    -0.80   0.426    -.0013584    .0005752
                           _cons |   4.680233   .0001938  2.4e+04   0.000     4.679852    4.680615
                    ------------------------------------------------------------------------------
                    
                    Absorbed degrees of freedom:
                    -----------------------------------------------------+
                     Absorbed FE | Categories  - Redundant  = Num. Coefs |
                    -------------+---------------------------------------|
                         muni_id |       337         337           0    *|
                      year_month |       108           1         107     |
                    -----------------------------------------------------+
                    * = FE nested within cluster; treated as redundant for DoF computation
                    
                    ---- CHECKPOINT: Log specification ----
                    
                    Interpretation:
                     Coefficients now measure % changes in crime
                     High-income: 0.0001 (1 SD return →   0.01% change)
                     Low-income: -0.0004 (1 SD return →  -0.04% change)
                    
                    ✅ Task 12.1 tests passed
                    
                    ⚠️ Why Do the Results Disappear in Logs?
                    You should find that the interaction coefficients become tiny and statistically insignificant in the log specification. This is an important finding, not a problem.

                    The level specification asks: "Do returns cause a different number of crimes across income groups?" — and finds significant effects.
                    The log specification asks: "Do returns cause a different percentage change in crime across income groups?" — and finds nothing.

                    This happens because large municipalities drive the level results. In levels, a municipality with 5,000 monthly crimes contributes far more variance than one with 20 crimes. The log transformation compresses the scale and equalizes the influence of small and large municipalities — and the effect vanishes.

                    This tells us the effect is about absolute crime counts shifting in specific large municipalities, not about crime rates proportionally changing across all income groups. Keep this distinction in mind when interpreting your results and discussing limitations.

                    Task 12.2: Assault Crimes Only¶

                    📝 Task
                    Test whether the differential effects hold when focusing on assault crimes only. Huck (2024) examines violent crime, and assault is the closest measure we have in the Netherlands data. Use the same TWFE specification with reghdfe, absorb(muni_id year_month), and vce(cluster muni_id). Use assault_crimes as the dependent variable instead of total_crimes. Store the estimates as m5_assault. This tests whether the psychological mechanism primarily affects violent vs property crime.
                    ========================================
                    Robustness Check 2: Assault Crimes
                    ========================================
                    
                    
                        Variable |        Obs        Mean    Std. dev.       Min        Max
                    -------------+---------------------------------------------------------
                    assault_cr~s |     36,396    11.27756    28.59991          0        463
                    (MWFE estimator converged in 2 iterations)
                    
                    HDFE Linear regression                            Number of obs   =     36,396
                    Absorbing 2 HDFE groups                           F(   2,    336) =       1.41
                    Statistics robust to heteroskedasticity           Prob > F        =     0.2466
                                                                      R-squared       =     0.9677
                                                                      Adj R-squared   =     0.9672
                                                                      Within R-sq.    =     0.0001
                    Number of clusters (muni_id) =        337         Root MSE        =     5.1757
                    
                                                  (Std. err. adjusted for 337 clusters in muni_id)
                    ------------------------------------------------------------------------------
                                 |               Robust
                    assault_cr~s | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                      ret_x_high |  -.0009849   .0071113    -0.14   0.890    -.0149731    .0130033
                       ret_x_low |    .026294   .0166023     1.58   0.114    -.0063635    .0589514
                           _cons |   11.27207   .0042482  2653.40   0.000     11.26371    11.28042
                    ------------------------------------------------------------------------------
                    
                    Absorbed degrees of freedom:
                    -----------------------------------------------------+
                     Absorbed FE | Categories  - Redundant  = Num. Coefs |
                    -------------+---------------------------------------|
                         muni_id |       337         337           0    *|
                      year_month |       108           1         107     |
                    -----------------------------------------------------+
                    * = FE nested within cluster; treated as redundant for DoF computation
                    
                    ---- CHECKPOINT: Assault crimes specification ----
                    
                    Question: Does focusing on violent crime change results?
                     High-income effect: -.00098488
                     Low-income effect: .02629398
                    
                    ✓ Pattern holds for violent crime (assault)
                    
                    ✅ Task 12.2 tests passed
                    

                    Task 12.3: Volatility Instead of Returns¶

                    📝 Task
                    Test an alternative mechanism: Does market uncertainty (volatility) affect crime, regardless of return direction? Create interaction terms between volatility and income terciles, then estimate the model. Name the interaction variables `vol_high` (`volatility * high_income`) and `vol_low` (`volatility * low_income`). Use the same TWFE specification: reghdfe total_crimes vol_high vol_low, absorb(muni_id year_month) vce(cluster muni_id). Store the estimates as m6_volatility. This tests whether psychological effects come from uncertainty rather than wealth changes.
                    ℹ️ Returns vs Volatility
                    Two psychological channels:
                    • Returns (direction): Investors feel richer when market rises → better mood → less crime
                    • Volatility (uncertainty): Market turbulence creates stress/anxiety → worse mood → more crime

                    Expected pattern if volatility matters:
                    Higher volatility → more crime in high-income areas (investors stressed by uncertainty)

                    Variable used: volatility = within-month SD of daily returns (calculated in Section 3)

                        Variable |        Obs        Mean    Std. dev.       Min        Max
                    -------------+---------------------------------------------------------
                      volatility |     36,396    .0094554    .0050592    .003573   .0422672
                        vol_high |     36,396    .0031425     .005324          0   .0422672
                         vol_low |     36,396    .0031705    .0053394          0   .0422672
                    (MWFE estimator converged in 2 iterations)
                    
                    HDFE Linear regression                            Number of obs   =     36,396
                    Absorbing 2 HDFE groups                           F(   2,    336) =       4.31
                    Statistics robust to heteroskedasticity           Prob > F        =     0.0142
                                                                      R-squared       =     0.9805
                                                                      Adj R-squared   =     0.9802
                                                                      Within R-sq.    =     0.0006
                    Number of clusters (muni_id) =        337         Root MSE        =    79.9474
                    
                                                  (Std. err. adjusted for 337 clusters in muni_id)
                    ------------------------------------------------------------------------------
                                 |               Robust
                    total_crimes | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                        vol_high |   56.42181   51.53063     1.09   0.274    -44.94149    157.7851
                         vol_low |  -769.2239   298.4051    -2.58   0.010    -1356.201   -182.2463
                           _cons |   225.9745   .9781892   231.01   0.000     224.0504    227.8987
                    ------------------------------------------------------------------------------
                    
                    Absorbed degrees of freedom:
                    -----------------------------------------------------+
                     Absorbed FE | Categories  - Redundant  = Num. Coefs |
                    -------------+---------------------------------------|
                         muni_id |       337         337           0    *|
                      year_month |       108           1         107     |
                    -----------------------------------------------------+
                    * = FE nested within cluster; treated as redundant for DoF computation
                    
                    ---- CHECKPOINT: Volatility specification ----
                    
                    Question: Does market uncertainty (not direction) affect crime?
                     High-income: 56.42181
                     Low-income: -769.2239
                    
                    High volatility → more crime in high-income areas (stress channel)
                    
                    ✅ Task 12.3 tests passed
                    
                    🚀 Interpreting Robustness Checks
                    What to conclude from robustness checks:

                    If results are similar across specifications:
                    → Findings are robust; not driven by specific modeling choices
                    → Confidence in main result increases

                    If results differ substantially:
                    → Effects may be heterogeneous (e.g., only for violent crime)
                    → Or main specification had issues (e.g., skewed outcome needed logs)
                    → Discuss why differences arise and which specification is preferred

                    If volatility matters more than returns:
                    → Uncertainty/stress channel dominates wealth effect
                    → Challenges Huck's specific mechanism

                    Common practice: Present main result (TWFE with levels) in text, show robustness in appendix table. If robustness checks reveal important heterogeneity, discuss in main text.


                    Section 13: Regression Table¶

                    We now create a publication-quality regression table combining all our main results. This table summarizes the entire empirical analysis in a format suitable for a thesis or research paper.

                    Good regression tables should:

                    1. Show coefficients with standard errors (in parentheses)
                    2. Use stars to indicate significance levels (*** p<0.01, ** p<0.05, * p<0.10)
                    3. Report key model statistics (N, R², fixed effects)
                    4. Have clear column headers describing each specification
                    5. Be formatted for easy comparison across models

                    We will create a table with the following models:

                    • Column 1: Compound baseline (compound returns, one-way FE) — naive benchmark
                    • Column 2: Standardized baseline (standardized returns, one-way FE)
                    • Column 3: Interactions (standardized returns, one-way FE)
                    • Column 4: TWFE (standardized returns, municipality + time FE)
                    • Column 5: Log specification (TWFE)
                    • Column 6: Assault crimes (TWFE)

                    Column 1 uses raw compound returns as a naive benchmark. Columns 2–6 use standardized returns following Huck (2024), which account for time-varying volatility.

                    This table will be exported as LaTeX for inclusion in reports.

                    Task 13.1: Create Publication-Quality Regression Table¶

                    📝 Task
                    Use esttab to combine all stored estimates into a professional table. Export to both screen display (for review) and LaTeX format (for thesis). Include coefficient estimates, standard errors, significance stars, and model statistics.
                    Stata Stata Tip: Creating Professional Tables with esttab
                    Key options for esttab:
                    • se(3) — show standard errors with 3 decimals
                    • star(* 0.10 ** 0.05 *** 0.01) — significance stars
                    • mtitles("Model 1" "Model 2" ...) — column headers
                    • keep(varlist) — which variables to show
                    • order(varlist) — row order
                    • stats(N r2_a, labels("Observations" "Adjusted R²")) — model statistics
                    • label — use variable labels instead of names
                    ========================================
                    Main Regression Results Table
                    ========================================
                    
                    
                    Table 1: Effect of Stock Returns on Crime by Income Group
                    -------------------------------------------------------------------------------
                    > -------------------------------------
                                             Compound    Std. Basel~e    Interactions            TW
                    > FE             Log         Assault   
                    -------------------------------------------------------------------------------
                    > -------------------------------------
                    Monthly compound r~n        33.72***                                           
                    >                                      
                                              (5.596)                                              
                    >                                      
                    
                    Monthly standardiz~m                        0.665***                           
                    >                                      
                                                              (0.101)                              
                    >                                      
                    
                    Standardized retur~e                                        1.307***        0.8
                    > 67***    -0.000392          0.0263   
                                                                              (0.284)         (0.29
                    > 1)         (0.000)         (0.017)   
                    
                    Standardized retur~c                                        0.440***           
                    >                                      
                                                                              (0.060)              
                    >                                      
                    
                    Standardized retur~m                                        0.243***       -0.1
                    > 97***     0.000119       -0.000985   
                                                                              (0.025)         (0.06
                    > 5)         (0.001)         (0.007)   
                    -------------------------------------------------------------------------------
                    > -------------------------------------
                    Observations               36,396          36,396          36,396          36,3
                    > 96          36,396          36,396   
                    Adjusted R²                 0.000           0.001           0.001           0.9
                    > 80           0.967           0.967   
                    -------------------------------------------------------------------------------
                    > -------------------------------------
                    Standard errors in parentheses
                    * p<0.10, ** p<0.05, *** p<0.01
                    
                    Notes for table:
                     • Dependent variable: Total crimes (Columns 1-4), Log(total crimes) (Column 5)
                    > , Assault crimes (Column 6)
                     • Column 1 uses compound returns; Columns 2-6 use standardized returns (Huck 2
                    > 024)
                     • All models include municipality fixed effects
                     • Columns 4-6 include time (year-month) fixed effects
                     • Standard errors clustered at municipality level in parentheses
                     • * p<0.10, ** p<0.05, *** p<0.01
                    (output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
                    > /private/assignments/06-assignment/output/tables/main_results.tex)
                    
                    ---- CHECKPOINT: Regression table created ----
                    ✅ Table exported to: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva
                    > te/private/assignments/06-assignment/output/tables/main_results.tex
                     Models included: 6 (Compound baseline through Assault)
                    
                    ✅ Task 13.1 tests passed
                    
                    🚀 Above and Beyond: Volatility Results Table
                    The volatility specification (m6) tests a different mechanism and uses different independent variables. It shouldn't be in the main table but could go in an appendix or supplementary table.

                    If you want to create a separate volatility table, use:
                    esttab m6_volatility using "$tables/volatility_results.tex"

                    This keeps your main results table clean and focused on the return → crime relationship.

                    ℹ️ Reading the Table
                    How to interpret the table:

                    Column progression (left to right):

                    1. Compound: Naive baseline using raw compound returns (no volatility adjustment)
                    2. Std. Baseline: Average effect using standardized returns (Huck's approach)
                    3. Interactions: Effects by income group (key test!)
                    4. TWFE: Same as (3) but with time FE (preferred specification)
                    5. Log: Percentage changes instead of levels
                    6. Assault: Violent crime only

                      What to look for:

                      • Column 1 vs 2: Does volatility standardization matter? Compare the baseline coefficients.
                      • Consistency across columns 3-6: Do interaction coefficients have same signs?
                      • Statistical significance: How many stars (*** = strong, ** = moderate, * = weak)?
                      • Economic magnitude: Are effects large enough to matter?
                      • R² changes: Does adding interactions improve fit?

                        For your thesis: This table would go in the main body. The volatility table (if created) would go in an appendix. Always accompany tables with 2-3 paragraphs of interpretation in text.


                        Section 14: Interpretation and Limitations (Ungraded and voluntary)¶

                        The empirical analysis is complete. Now we interpret the findings and honestly discuss limitations.

                        Good empirical papers have three key elements:

                        1. Clear interpretation of what the results mean economically and statistically
                        2. Honest discussion of limitations (what the study cannot test or conclude)
                        3. Constructive suggestions for how future work could improve identification

                        Why limitations matter:

                        • Prevents over-claiming (saying more than the data can support)
                        • Shows you understand the research design deeply
                        • Guides readers on how to interpret your findings
                        • In your thesis, examiners expect thoughtful discussion of what you cannot claim

                        This section has three writing tasks that synthesize everything you've learned.

                        Task 14.1: Summarize Your Findings¶

                        📝 Task
                        Write a summary of your main findings (minimum 200 words). Address the following questions:
                        1. Main result: Do you find differential effects of returns across income groups?
                        2. Signs: Are the signs consistent with Huck (2024)'s theory? (High negative, low positive?)
                        3. Statistical significance: Are the key coefficients statistically significant?
                        4. Economic magnitude: How large are the effects? (e.g., "A 1 SD positive return reduces crime by X incidents in high-income areas")
                        5. Robustness: Do results hold across specifications (TWFE, log, assault)?
                        6. Comparison to Huck: Are your findings consistent with or different from the US results?

                        Your findings summary (minimum 200 words):

                        (Write your answer below)

                        Task 14.2: Discuss Limitations¶

                        📝 Task
                        Discuss at least three substantive limitations of this analysis (minimum 150 words). For each limitation:
                        • Explain what the limitation is
                        • Why it matters (what it prevents you from concluding)
                        • How it differs from the original Huck (2024) study

                        Suggested limitations to consider:

                        1. Monthly vs daily frequency: Cannot test contemporaneous/hourly timing effects
                        2. Single market return: No cross-sectional variation in "treatment" within months
                        3. Income as proxy: Municipality income ≠ individual investor status (ecological fallacy)
                        4. Standard error concerns: Few time periods, spatial correlation across municipalities
                        5. Crime reporting differences: Measurement error varies by municipality/crime type
                        6. Selection/attrition: Municipality boundaries changed over time (mergers)

                          ⚠️ Important: Be Honest
                          Good researchers are honest about limitations. This doesn't weaken your work—it shows you understand the research design deeply.

                          What NOT to do:

                          • ❌ "We don't have enough data" (too vague)
                          • ❌ "Results might be wrong" (undermines everything)
                          • ❌ List minor issues without explaining why they matter

                            What TO do:

                            • ✓ "Monthly data cannot test within-day timing effects (Huck's Table 5), limiting our ability to distinguish same-day from lagged effects"
                            • ✓ "Municipality-level income is an imperfect proxy for investor participation; Huck uses actual brokerage data, providing more precise treatment assignment"
                            • ✓ Explain the consequence of each limitation

                              Your limitations discussion (minimum 150 words, at least 3 points):

                              (Write your answer below)

                              Limitation 1:

                              Limitation 2:

                              Limitation 3:

                              Task 14.3: What Would Strengthen This Analysis?¶

                              📝 Task
                              Suggest at least three improvements that would strengthen the identification or generalizability of this analysis (minimum 100 words). Think creatively about:
                              • Better data: What data would you ideally have?
                              • Natural experiments: Are there shocks you could exploit for causal identification?
                              • Alternative mechanisms: What else could you test?
                              • Extensions: How could you expand this to related questions?

                              Examples to consider:

                              • Daily crime data (if it existed in NL)
                              • Individual investor data (brokerage accounts by municipality)
                              • Natural experiment: Neo-broker entry (Trade Republic, DEGIRO expansion) for DiD
                              • Cross-country comparison (UK has better crime data?)
                              • Alternative outcomes: Health, well-being, Google searches
                              • Crypto markets (more volatile, younger investors)

                                Your suggestions for improvement (minimum 100 words, at least 3 ideas):

                                (Write your answer below)

                                Improvement 1:

                                Improvement 2:

                                Improvement 3:


                                Part B Summary¶

                                Congratulations! You have completed Part B: Analysis.

                                What you accomplished:

                                1. ✅ Section 7: Created descriptive statistics by income tercile
                                2. ✅ Section 8: Visualized crime and return trends over time
                                3. ✅ Section 9: Estimated baseline panel regressions
                                4. ✅ Section 10: Tested differential effects by income (core hypothesis)
                                5. ✅ Section 11: Added two-way fixed effects for better identification
                                6. ✅ Section 12: Conducted robustness checks (log, assault, volatility)
                                7. ✅ Section 13: Created publication-quality regression table
                                8. ✅ Section 14: Interpreted findings and discussed limitations honestly

                                Outputs created:

                                • Tables: summary_stats_panelA.tex, summary_stats_panelB.tex, main_results.tex
                                • Figures: crime_timeseries.png, crime_returns_overlay.png, crime_by_income.png
                                • Stored estimates: 6 regression models ready for comparison

                                Skills practiced:

                                • Panel data regression with fixed effects
                                • Testing interaction effects
                                • Two-way fixed effects estimation
                                • Robustness checks
                                • Creating professional tables and figures
                                • Interpreting results and discussing limitations

                                Your thesis Module 2/3 will look very similar:

                                • Part A (Data Preparation): 60-70% of your time
                                • Part B (Analysis): 20-30% of your time
                                • Part C (Writing/Interpretation): 10% of your time

                                The skills you practiced here—data wrangling, regression analysis, robustness checks, honest limitation discussion—are exactly what you'll need for your MSc thesis.


                                Part C: Thesis Connection and Reflection¶

                                This final section connects your work to the MSc thesis and reflects on what you've learned.

                                Section 15: Thesis Module 2/3 Connection¶

                                Your MSc Finance thesis has three modules:

                                • Module 1: Literature review and research question
                                • Module 2: Replication of published results with new/similar data
                                • Module 3: Extension or new analysis building on Module 2

                                This assignment mirrors the Module 2/3 workflow. You've essentially completed a mini-thesis:

                                • Read a published paper (Huck 2024)
                                • Replicated the methodology with new data (Netherlands instead of US)
                                • Tested whether findings generalize to a new setting

                                Section 16: Conclusion¶

                                You have completed a full replication/extension exercise from start to finish.

                                What you accomplished:

                                Part A (Data Preparation):

                                • Loaded and examined three raw datasets
                                • Calculated monthly returns following published methodology
                                • Reshaped data from long to wide format
                                • Merged multiple datasets into a panel
                                • Created analysis variables (terciles, interactions)

                                Part B (Analysis):

                                • Created descriptive statistics and visualizations
                                • Estimated baseline and interaction regressions
                                • Tested differential effects across income groups (Huck's hypothesis)
                                • Added two-way fixed effects for better identification
                                • Conducted robustness checks
                                • Created publication-quality tables
                                • Interpreted results and discussed limitations honestly

                                Key lessons:

                                1. Data preparation is 60-80% of empirical work — You spent most time in Part A, not Part B. This is normal and expected.

                                2. Published papers use better data than you'll have — Huck has daily crime data and actual investor accounts. You have monthly aggregates and income proxies. This is the reality of replication work.

                                3. Honest limitation discussion is a strength — Acknowledging what you cannot test shows deep understanding of research design.

                                4. Even null results are informative — If you don't find Huck's pattern in NL data, that's a contribution (generalizability is limited).

                                5. Extensions matter more than perfect replication — Your thesis should test something new, even if it's a simple extension.

                                For your MSc thesis:

                                • Start data preparation early (it always takes longer than expected)
                                • Be realistic about what you can test with available data
                                • Focus on a simple, clear extension rather than perfect replication
                                • Discuss limitations honestly (examiners value this highly)

                                Final thought:

                                Research is about answering questions with imperfect data and honest acknowledgment of limitations. You've practiced exactly that in this assignment. These skills—data wrangling, regression analysis, robustness testing, limitation discussion—will serve you well in your thesis and beyond.

                                Good luck with your MSc thesis! 🎓


                                Data Analytics for Finance

                                BM17FI · Academic Year 2025–26

                                Erasmus University Rotterdam

                                Created by: Caspar David Peter

                                © 2026 Rotterdam School of Management