Data Analytics for Finance
BM17FI · Rotterdam School of Management
Replication Exercise
Learning Objectives¶
This assignment prepares you for MSc thesis Module 2/3 (replication and extension) by having you:
- Read and understand methodology from a published empirical finance paper
- Build an analysis panel from multiple raw data sources (merge, reshape)
- Calculate returns following published methodology (compound, standardized)
- Create derived variables (terciles, interactions)
- Work with panel data at geographic (municipality) level
- Estimate fixed effects regressions with appropriate standard errors
- Interpret results and discuss limitations honestly
- Understand the gap between "ideal" identification and feasible tests
- Load and examine three raw datasets (crime, income, stock returns)
- Calculate monthly stock returns following published methodology
- Reshape crime data from long to wide format
- Merge datasets to create a municipality-month panel
- Create analysis variables (income terciles, interactions)
- Estimate panel regressions testing differential effects
- Interpret findings and discuss limitations
- Reflect on thesis Module 2/3 connection
Setup¶
Environment Setup¶
We begin by clearing the environment and preparing Stata for analysis.
✅ Graph scheme set to stcolor. ✅ The environment is cleared and ready.
Set File Paths¶
We define global macros for all file paths to keep code clean and portable.
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment > s/06-assignment 📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/06-assignment 📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat > e/private/assignments/06-assignment/data/raw 📁 Processed data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance- > private/private/assignments/06-assignment/data/processed 📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva > te/private/assignments/06-assignment/output 📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/ > private/assignments/06-assignment/output/tables 📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/06-assignment/output/figures
Part A: Data Preparation¶
Section 1: Introduction and Paper Context¶
Research Question¶
This assignment is based on Huck (2024) "The Psychological Externalities of Investing: Evidence from Stock Returns and Crime" published in The Review of Financial Studies.
Key finding from Huck (2024):
- When stock markets rise, violent crime decreases in high-income areas (where people are more likely to be investors)
- When stock markets rise, violent crime increases in low-income areas (where people are less likely to be investors)
- This suggests that relative wealth/status affects psychological well-being and behavior
What We're Testing¶
We test whether this differential relationship holds in the Netherlands using:
- Monthly crime data from CBS/Politie (~350 municipalities, 2012-2019)
- Monthly AEX index returns
- Cross-sectional income data by municipality
Why This Assignment Matters for Your Thesis¶
Published research often starts with an observation in one setting. A valuable contribution is testing whether findings generalize to new contexts. Even null results are informative.
In your thesis, you will spend 60-80% of your time preparing data. This assignment gives you practice with the exact operations you'll need:
- Merging datasets from different sources
- Calculating returns properly (following published methodology)
- Reshaping data
- Creating analysis variables
Assignment Roadmap¶
Part A: Data Preparation
- Load raw data (crime, income, AEX prices)
- Calculate monthly returns (compound and standardized)
- Reshape crime data (long → wide)
- Merge datasets into a municipality-month panel
- Create analysis variables (terciles, interactions)
Standardization matters because a 2% return during low volatility is more "newsworthy" than 2% during a crisis. A standardized return of 2 means the market moved 2 standard deviations—regardless of the volatility regime.
For monthly analysis, we aggregate standardized daily returns. This preserves the "surprise" interpretation while matching our crime data frequency.
We are testing a weaker hypothesis: whether monthly returns correlate with monthly crime differently across income groups.
This is a common constraint in replication work—data availability often limits what can be tested. Your thesis will face similar challenges.
Section 2: Load and Examine Raw Data¶
We have three raw datasets to work with:
- crime.dta: Monthly crime counts by municipality (long format by crime type)
- income_data.dta: Average median income by municipality (cross-section)
- aex_daily.dta: Daily AEX index prices (we'll compute returns)
Let's load and examine each dataset to understand its structure.
Task 2.1: Load and Examine Crime Data¶
$raw/crime.dta and examine its structure. The data should contain monthly crime counts for each municipality and crime type. Use describe to see variables, list to see the first few observations, and summarize to get basic statistics.
use "$raw/crime.dta", clear— to load the datasetdescribe— to see variable names and typeslist in 1/10— to view first 10 rowstab crime_type— to see what crime types are includedsummarize reported_crimes— to get basic statistics
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/crime.dta
Observations: 172,344
Variables: 5 06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
crime_type str21 %-9s Type of Crime
municipality~de str6 %-9s Municipality Code
municipality~me str29 %-9s Municipality Name
reported_crimes long %12.0g
date double %td Date (Year-Month)
-------------------------------------------------------------------------------
Sorted by:
+-------------------------------------------------------------------+
| crime_type munic~de municipa~me report~s date |
|-------------------------------------------------------------------|
1. | Totaal misdrijven GM1680 Aa en Hunze 84 01jan2012 |
2. | Totaal misdrijven GM1680 Aa en Hunze 103 01feb2012 |
3. | Totaal misdrijven GM1680 Aa en Hunze 101 01mar2012 |
4. | Totaal misdrijven GM1680 Aa en Hunze 71 01apr2012 |
5. | Totaal misdrijven GM1680 Aa en Hunze 78 01may2012 |
|-------------------------------------------------------------------|
6. | Totaal misdrijven GM1680 Aa en Hunze 50 01jun2012 |
7. | Totaal misdrijven GM1680 Aa en Hunze 69 01jul2012 |
8. | Totaal misdrijven GM1680 Aa en Hunze 99 01aug2012 |
9. | Totaal misdrijven GM1680 Aa en Hunze 73 01sep2012 |
10. | Totaal misdrijven GM1680 Aa en Hunze 77 01oct2012 |
+-------------------------------------------------------------------+
Type of Crime | Freq. Percent Cum.
----------------------+-----------------------------------
1.4.2 Moord, doodslag | 57,448 33.33 33.33
1.4.5 Mishandeling | 57,448 33.33 66.67
Totaal misdrijven | 57,448 33.33 100.00
----------------------+-----------------------------------
Total | 172,344 100.00
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
reported_c~s | 126,324 101.4997 377.645 1 9239
---- CHECKPOINT: Crime data loaded ----
Observations: 172344
Variables: 5
✅ Task 2.1 tests passed
Task 2.2: Load and Examine Income Data¶
$raw/income_data.dta and examine its structure. This is a cross-sectional dataset with one observation per municipality containing average median standardized income for 2011-2018.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/income_data.dta
Observations: 355
Variables: 2 06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
municipality_~e str6 %-9s Municipality Code
avg_median_st~e double %10.0g Average Median Standardized
Income
-------------------------------------------------------------------------------
Sorted by:
+----------------------+
| munici~e avg_med~e |
|----------------------|
1. | GM0003 20.606186 |
2. | GM0010 21.084211 |
3. | GM0014 19.855506 |
4. | GM0024 22.255462 |
5. | GM0034 22.317833 |
|----------------------|
6. | GM0037 20.437981 |
7. | GM0047 20.88774 |
8. | GM0050 23.638929 |
9. | GM0059 20.745074 |
10. | GM0060 23.382847 |
+----------------------+
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
avg_median~e | 355 23.53126 1.867351 19.83227 36.45822
---- CHECKPOINT: Income data loaded ----
Municipalities: 355
✅ Task 2.2 tests passed
Task 2.3: Load and Examine AEX Daily Price Data¶
$raw/aex_daily_data.dta and examine its structure. This dataset contains daily closing prices for the AEX index. We will calculate returns from these prices in the next section.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/aex_daily_data.dta
Observations: 2,609
Variables: 2 06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
date double %td Date
price double %10.0g AEX Index Closing Price
-------------------------------------------------------------------------------
Sorted by:
+----------------------+
| date price |
|----------------------|
1. | 03jan2011 359.8629 |
2. | 04jan2011 358.8617 |
3. | 05jan2011 357.2818 |
4. | 06jan2011 356.8858 |
5. | 07jan2011 356.4425 |
|----------------------|
6. | 10jan2011 354.2092 |
7. | 11jan2011 358.303 |
8. | 12jan2011 362.4137 |
9. | 13jan2011 360.7873 |
10. | 14jan2011 361.3202 |
+----------------------+
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 2,609 451.2762 92.43622 263.4371 629.2317
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
date | 2,609 20454.8 1054.616 18630 22280
---- CHECKPOINT: AEX data loaded ----
Trading days: 2609
✅ Task 2.3 tests passed
Section 3: Calculate AEX Returns (Following Huck 2024 Methodology)¶
Following Huck (2024), we need to calculate standardized returns. This involves:
- Calculate daily returns
- Calculate trailing 252-day standard deviation (one trading year)
- Standardize daily returns by dividing by trailing SD
- Aggregate to monthly frequency
We'll also calculate compound monthly returns for comparison.
Task 3.1: Calculate Daily Returns¶
- Simple returns: ret_daily = (pricet / pricet-1) - 1
- Log returns: ret_log = ln(pricet / pricet-1)
sort date— sort data chronologicallygen ret_daily = (price / price[_n-1]) - 1— simple returngen ret_log = ln(price / price[_n-1])— log returnsummarize ret_daily, detail— check for reasonable range
(1 missing value generated)
(1 missing value generated)
Daily simple return
-------------------------------------------------------------
Percentiles Smallest
1% -.0323815 -.1075333
5% -.0172826 -.0764755
10% -.0116792 -.05704 Obs 2,608
25% -.0047384 -.052363 Sum of wgt. 2,608
50% .0004368 Mean .0002727
Largest Std. dev. .0110496
75% .0058393 .0421586
90% .0118066 .0452309 Variance .0001221
95% .0171069 .0461471 Skewness -.5256861
99% .0295844 .0897213 Kurtosis 10.82545
Non-missing returns: 2608
Mean daily return: 0.0003
Std dev: 0.0110
✅ Task 3.1 tests passed
Task 3.2: Calculate Trailing 252-Day Standard Deviation¶
rangestat package to calculate rolling statistics efficiently.
rangestat command calculates statistics over a moving window:
rangestat (sd) trailing_sd = ret_daily, interval(date -252 -1)- The interval
-252 -1means "from 252 days ago to 1 day ago" (excludes today) - First ~252 observations will have missing trailing_sd (not enough history)
Trailing 252-day SD of returns
-------------------------------------------------------------
Percentiles Smallest
1% .0052468 .0011458
5% .0058144 .0015453
10% .0067857 .0016472 Obs 2,606
25% .0075677 .0021919 Sum of wgt. 2,606
50% .0084327 Mean .0102832
Largest Std. dev. .0038261
75% .0130778 .0203429
90% .0162531 .0203489 Variance .0000146
95% .0187518 .0203502 Skewness .9560908
99% .020195 .0203506 Kurtosis 2.984837
3
Missing SD values (first ~252 days): 3
Mean trailing SD: 0.0103
✅ Task 3.2 tests passed
Task 3.3: Calculate Standardized Daily Returns¶
Name the new variable ret_std_daily.
(3 missing values generated)
Standardized daily return (ret/SD)
-------------------------------------------------------------
Percentiles Smallest
1% -3.347214 -9.463235
5% -1.738406 -7.754665
10% -1.204699 -5.648883 Obs 2,606
25% -.5011903 -4.94842 Sum of wgt. 2,606
50% .039744 Mean .0189555
Largest Std. dev. 1.077812
75% .6185057 3.729102
90% 1.172071 4.049637 Variance 1.161679
95% 1.639769 5.27277 Skewness -.7527249
99% 2.53291 5.910226 Kurtosis 8.611722
Non-missing standardized returns: 2606
Mean: 0.0190
Std dev: 1.0778
✅ Task 3.3 tests passed
Task 3.4: Create Year-Month Variable¶
mofd() function to convert daily dates to monthly format.
gen year_month = mofd(date)— convert daily date to monthlyformat year_month %tm— format as YYYY-MM
✅ Task 3.4 tests passed
Task 3.5: Aggregate to Monthly Frequency¶
- Compound return: exp(sum of log returns) - 1
- Standardized monthly return: sum of standardized daily returns
- Volatility: standard deviation of daily returns
- Trading days: count of observations per month
bysort year_month: egen ret_log_sum = total(ret_log)gen ret_compound = exp(ret_log_sum) - 1bysort year_month: egen ret_std_monthly = total(ret_std_daily)collapse (first) ret_compound ret_std_monthly (sd) volatility = ret_daily (count) n_trading_days = ret_daily, by(year_month)gen date = dofm(year_month)— create a daily date variable for later merging with crime dataformat date %td— apply date format
+--------------------------------------------------------------------+
| year_m~h ret_com~d ret_std~y volati~y n_trad~s date |
|--------------------------------------------------------------------|
1. | 2011m1 .0024548 .5534533 .007666 20 01jan2011 |
2. | 2011m2 .0232263 2.948537 .0068946 20 01feb2011 |
3. | 2011m3 -.0094981 -2.251385 .0097187 23 01mar2011 |
4. | 2011m4 -.0155369 -1.979866 .0076843 21 01apr2011 |
5. | 2011m5 -.0291778 -3.587304 .0083145 22 01may2011 |
|--------------------------------------------------------------------|
6. | 2011m6 -.028016 -3.423634 .0093389 22 01jun2011 |
7. | 2011m7 -.0308594 -3.683877 .010141 21 01jul2011 |
8. | 2011m8 -.1100779 -13.34819 .0223073 23 01aug2011 |
9. | 2011m9 -.0435389 -3.706579 .023562 22 01sep2011 |
10. | 2011m10 .0975143 6.884334 .0174383 21 01oct2011 |
|--------------------------------------------------------------------|
11. | 2011m11 -.0254433 -1.531934 .0208009 22 01nov2011 |
12. | 2011m12 .0427059 2.665466 .0124832 22 01dec2011 |
+--------------------------------------------------------------------+
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
year_month | 120 671.5 34.78505 612 731
ret_compound | 120 .0054344 .0408573 -.1100779 .1351352
ret_std_mo~y | 120 .4116507 4.289544 -13.34819 8.55268
volatility | 120 .0098128 .0052895 .003573 .0422672
n_trading_~s | 120 21.73333 .9850422 20 23
-------------+---------------------------------------------------------
date | 120 20438.45 1058.793 18628 22250
---- CHECKPOINT: Monthly aggregation complete ----
Months: 120
Mean monthly return: 0.0054
SD of monthly returns: 0.0409
✅ Task 3.5 tests passed
Task 3.6: Save Monthly Returns Data¶
$processed/aex_monthly.dta. This will be merged with the crime data later.
file
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/06-assignment/data/processed/aex_monthly.dta saved
✅ Task 3.6 tests passed
Section 4: Reshape Crime Data (Long to Wide)¶
The crime data is currently in long format: one row per municipality × month × crime_type combination.
We need wide format: one row per municipality × month, with separate columns for each crime type.
Why? For regression analysis, we need a single outcome variable per observation.
reshape command requires:
- Unique i() identifier — here: municipality_code date
- j() variable with values to become column suffixes — here: crime_id (numeric)
- Variables to reshape — here: reported_crimes
- Duplicates in i() j() combinations → check with:
duplicates report municipality_code date crime_id - String j() variable → must be numeric (that's why we create crime_id)
Task 4.1: Examine Current Structure¶
tab crime_type to see what crime types are included.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/data/raw/crime.dta
Observations: 172,344
Variables: 5 06 Jan 2026 22:55
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
crime_type str21 %-9s Type of Crime
municipality~de str6 %-9s Municipality Code
municipality~me str29 %-9s Municipality Name
reported_crimes long %12.0g
date double %td Date (Year-Month)
-------------------------------------------------------------------------------
Sorted by:
Type of Crime | Freq. Percent Cum.
----------------------+-----------------------------------
1.4.2 Moord, doodslag | 57,448 33.33 33.33
1.4.5 Mishandeling | 57,448 33.33 66.67
Totaal misdrijven | 57,448 33.33 100.00
----------------------+-----------------------------------
Total | 172,344 100.00
+--------------------------------------------------------+
| municipa~me date crime_type report~s |
|--------------------------------------------------------|
1. | Aa en Hunze 01jan2012 Totaal misdrijven 84 |
2. | Aa en Hunze 01feb2012 Totaal misdrijven 103 |
3. | Aa en Hunze 01mar2012 Totaal misdrijven 101 |
4. | Aa en Hunze 01apr2012 Totaal misdrijven 71 |
5. | Aa en Hunze 01may2012 Totaal misdrijven 78 |
|--------------------------------------------------------|
6. | Aa en Hunze 01jun2012 Totaal misdrijven 50 |
7. | Aa en Hunze 01jul2012 Totaal misdrijven 69 |
8. | Aa en Hunze 01aug2012 Totaal misdrijven 99 |
9. | Aa en Hunze 01sep2012 Totaal misdrijven 73 |
10. | Aa en Hunze 01oct2012 Totaal misdrijven 77 |
|--------------------------------------------------------|
11. | Aa en Hunze 01nov2012 Totaal misdrijven 67 |
12. | Aa en Hunze 01dec2012 Totaal misdrijven 53 |
13. | Aa en Hunze 01jan2013 Totaal misdrijven 76 |
14. | Aa en Hunze 01feb2013 Totaal misdrijven 65 |
15. | Aa en Hunze 01mar2013 Totaal misdrijven 56 |
+--------------------------------------------------------+
---- CHECKPOINT: Crime data structure examined ----
Total observations (municipality × month × crime_type): 172344
Crime types: 3
✅ Task 4.1 tests passed
Task 4.2: Create Numeric Crime Type Identifier¶
reshape requires numeric j() variable:
- 1 = "Totaal misdrijven" (Total crimes)
- 2 = "1.4.5 Mishandeling" (Assault)
- 3 = "1.4.2 Moord, doodslag" (Murder/manslaughter)
Name the new variable crime_id.
(172,344 missing values generated)
(57,448 real changes made)
(57,448 real changes made)
(57,448 real changes made)
| Type of Crime
crime_id | 1.4.2 M.. 1.4.5 M.. Totaal .. | Total
-----------+---------------------------------+----------
total | 0 0 57,448 | 57,448
assault | 0 57,448 0 | 57,448
murder | 57,448 0 0 | 57,448
-----------+---------------------------------+----------
Total | 57,448 57,448 57,448 | 172,344
✅ Task 4.2 tests passed
Task 4.3: Reshape Wide¶
keep municipality_code municipality_name date crime_id reported_crimes— keep only needed variablesreshape wide reported_crimes, i(municipality_code date) j(crime_id)rename reported_crimes1 total_crimes— rename to meaningful names
Duplicates in terms of municipality_code date crime_id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 172344 0
--------------------------------------
(j = 1 2 3)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 172,344 -> 57,448
Number of variables 5 -> 6
j variable (3 values) crime_id -> (dropped)
xij variables:
reported_crimes -> reported_crimes1 reported_crimes
> 2 reported_crimes3
-----------------------------------------------------------------------------
Contains data
Observations: 57,448
Variables: 6
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
municipality~de str6 %-9s Municipality Code
date double %td Date (Year-Month)
total_crimes long %12.0g Total crimes reported
assault_crimes long %12.0g Assault crimes reported
murder_crimes long %12.0g Murder/manslaughter crimes
reported
municipality~me str29 %-9s Municipality Name
-------------------------------------------------------------------------------
Sorted by: municipality_code date
Note: Dataset has changed since last saved.
+-------------------------------------------------------+
1. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01jan2012 | 1562 | 111 | 5 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
2. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01feb2012 | 1373 | 95 | 4 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
3. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01mar2012 | 1663 | 105 | 8 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
4. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01apr2012 | 1520 | 91 | 3 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
5. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01may2012 | 1728 | 90 | 3 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
6. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01jun2012 | 1547 | 98 | 8 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
7. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01jul2012 | 1598 | 86 | 4 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
8. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01aug2012 | 1450 | 83 | 5 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
9. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01sep2012 | 1742 | 100 | 6 |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
+-------------------------------------------------------+
10. | munic~de | date | total_~s | assaul~s | murder~s |
| GM0014 | 01oct2012 | 1703 | 108 | . |
|-------------------------------------------------------|
| municipality_name |
| Groningen (gemeente) |
+-------------------------------------------------------+
---- CHECKPOINT: Data reshaped wide ----
Observations (municipality × month): 57448
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
total_crimes | 57,243 212.5807 539.8091 1 9239
assault_cr~s | 51,940 11.79105 28.66676 1 463
murder_cri~s | 17,141 2.372032 3.827666 1 48
✅ Task 4.3 tests passed
Task 4.4: Handle Missing Values¶
205
Replacing 205 missing values in total_crimes with 0
(205 real changes made)
5,508
Replacing 5508 missing values in assault_crimes with 0
(5,508 real changes made)
40,307
Replacing 40307 missing values in murder_crimes with 0
(40,307 real changes made)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
total_crimes | 57,448 211.8221 538.9942 0 9239
assault_cr~s | 57,448 10.66055 27.47804 0 463
murder_cri~s | 57,448 .7077531 2.35568 0 48
0
Remaining missing values: 0
✅ Task 4.4 tests passed
Task 4.5: Save Wide Crime Data¶
$processed/crime_wide.dta.
file
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/06-assignment/data/processed/crime_wide.dta saved
✅ Wide crime data saved to: /Users/casparm4/Github/rsm-data-analytics-in-financ
> e-private/private/assignments/06-assignment/data/processed/crime_wide.dta
Observations: 57448
✅ Task 4.5 tests passed
Section 5: Merge Datasets¶
Now we merge our three datasets:
- Crime data (wide format): municipality × month panel
- Income data: municipality-level cross-section
- AEX returns: monthly time series
The goal is to create a municipality-month panel with all variables.
Task 5.1: Start with Crime Panel¶
$processed/crime_wide.dta to use as the base dataset for merging. This will be our "master" dataset since it defines the municipality-month panel structure.
Observations: 57448 Variables: 6 Time period: 01jan2012 to 01nov2025
✅ Task 5.1 tests passed
Task 5.2: Merge Income Data¶
- Crime data has many observations per municipality (one per month)
- Income data has one observation per municipality (cross-section)
merge m:1 municipality_code using "$raw/income_data.dta"tab _merge— check merge resultskeep if _merge == 3— keep only matcheddrop _merge— remove merge indicator
Result Number of obs
-----------------------------------------
Not matched 1,187
from master 1,169 (_merge==1)
from using 18 (_merge==2)
Matched 56,279 (_merge==3)
-----------------------------------------
Matching result from |
merge | Freq. Percent Cum.
------------------------+-----------------------------------
Master only (1) | 1,169 2.03 2.03
Using only (2) | 18 0.03 2.07
Matched (3) | 56,279 97.93 100.00
------------------------+-----------------------------------
Total | 57,466 100.00
(1,187 observations deleted)
Observations after merge: 56279
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
avg_median~e | 56,279 23.50222 1.863046 19.83227 36.45822
✅ Task 5.2 tests passed
Task 5.3: Create Year-Month Variable for Merging¶
---- CHECKPOINT: Year-month variable created ----
Unique months: 167
+---------------------------------+
| munic~de date year_m~h |
|---------------------------------|
1. | GM0014 01jan2012 2012m1 |
2. | GM0014 01feb2012 2012m2 |
3. | GM0014 01mar2012 2012m3 |
4. | GM0014 01apr2012 2012m4 |
5. | GM0014 01may2012 2012m5 |
+---------------------------------+
✅ Task 5.3 tests passed
Task 5.4: Merge AEX Returns¶
- Crime data has many observations per month (one per municipality)
- Returns data has one observation per month (time series)
$processed/aex_monthly.dta. Keep only matched observations.
Result Number of obs
-----------------------------------------
Not matched 19,895
from master 19,883 (_merge==1)
from using 12 (_merge==2)
Matched 36,396 (_merge==3)
-----------------------------------------
Matching result from |
merge | Freq. Percent Cum.
------------------------+-----------------------------------
Master only (1) | 19,883 35.32 35.32
Using only (2) | 12 0.02 35.34
Matched (3) | 36,396 64.66 100.00
------------------------+-----------------------------------
Total | 56,291 100.00
(19,895 observations deleted)
---- CHECKPOINT: AEX returns merged ----
Observations after merge: 36396
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ret_compound | 36,396 .0072072 .0393381 -.1037125 .1351352
ret_std_mo~y | 36,396 .6468432 4.149869 -12.38631 8.55268
✅ Task 5.4 tests passed
Task 5.5: Verify Panel Structure and Save¶
xtset and save the merged analysis panel to $processed/analysis_panel.dta. Use xtdescribe to verify the panel is balanced and check the time coverage.
encode municipality_code, gen(muni_id)— create numeric municipality IDxtset muni_id year_month— declare panel structurextdescribe— verify panel balancesave "$processed/analysis_panel.dta", replace— save the merged panel
Panel variable: muni_id (strongly balanced)
Time variable: year_month, 2012m1 to 2020m12
Delta: 1 month
muni_id: 1, 2, ..., 337 n = 337
year_month: 2012m1, 2012m2, ..., 2020m12 T = 108
Delta(year_month) = 1 month
Span(year_month) = 108 periods
(muni_id*year_month uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
108 108 108 108 108 108 108
Freq. Percent Cum. | Pattern
---------------------------+--------------------------------------------------
> ------------------------------------------------------------
337 100.00 100.00 | 111111111111111111111111111111111111111111111111
> 111111111111111111111111111111111111111111111111111111111111
---------------------------+--------------------------------------------------
> ------------------------------------------------------------
337 100.00 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
file
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/06-assignment/data/processed/analysis_panel.dta saved
✅ Merged analysis panel saved to: /Users/casparm4/Github/rsm-data-analytics-in-
> finance-private/private/assignments/06-assignment/data/processed/analysis_pan
> el.dta
Municipalities: 337
Time periods: .
Total observations: 36396
✅ Task 5.5 tests passed
Section 6: Create Analysis Variables¶
Following Huck (2024), we test whether stock returns have differential effects on crime across income groups.
To do this, we need to:
- Create income terciles (high, medium, low income municipalities)
- Generate interaction terms (returns × income terciles)
- Create crime rate variables (per capita)
- Verify the variables are ready for regression
This section prepares all variables needed for the panel regressions in Part B.
Task 6.1: Load Analysis Panel¶
$processed/analysis_panel.dta and verify the panel structure is intact.
Panel variable: muni_id (strongly balanced)
Time variable: year_month, 2012m1 to 2020m12
Delta: 1 month
✅ Task 6.1 tests passed
Task 6.2: Create Income Terciles¶
- Tercile 1: Low-income municipalities (bottom 33%)
- Tercile 2: Medium-income municipalities (middle 33%)
- Tercile 3: High-income municipalities (top 33%)
xtile to create the tercile variable.
xtile income_tercile = avg_median_std_income, nq(3)— create tercilestab income_tercile— verify distributionbysort income_tercile: summarize avg_median_std_income— check income ranges
Income |
tercile |
(1=low, |
2=mid, |
3=high) | Freq. Percent Cum.
--------------+-----------------------------------
Low income | 12,204 33.53 33.53
Medium income | 12,096 33.23 66.77
High income | 12,096 33.23 100.00
--------------+-----------------------------------
Total | 36,396 100.00
-------------------------------------------------------------------------------
-> income_tercile = Low income
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
avg_median~e | 12,204 21.7271 .7816501 19.83227 22.80175
-------------------------------------------------------------------------------
-> income_tercile = Medium income
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
avg_median~e | 12,096 23.34742 .2878872 22.82476 23.89005
-------------------------------------------------------------------------------
-> income_tercile = High income
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
avg_median~e | 12,096 25.448 1.659939 23.91478 36.45822
---- CHECKPOINT: Income terciles created ----
Tercile groups: 3
✅ Task 6.2 tests passed
Task 6.3: Create Tercile Dummy Variables¶
low_income: 1 if tercile 1, 0 otherwisemid_income: 1 if tercile 2, 0 otherwisehigh_income: 1 if tercile 3, 0 otherwise
✅ Task 6.3 tests passed
Task 6.4: Create Interaction Terms¶
ret_x_low= ret_std_monthly × low_incomeret_x_mid= ret_std_monthly × mid_incomeret_x_high= ret_std_monthly × high_income
By including interaction terms, we estimate separate effects:
We expect β₃ < 0 (returns ↓ crime in high-income areas) and β₁ > 0 (returns ↑ crime in low-income areas).
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ret_x_low | 36,396 .216894 2.422353 -12.38631 8.55268
ret_x_mid | 36,396 .2149746 2.411696 -12.38631 8.55268
ret_x_high | 36,396 .2149746 2.411696 -12.38631 8.55268
---- CHECKPOINT: Interaction terms created ----
Interactions ready for regression analysis
✅ Task 6.4 tests passed
Task 6.5: Save Final Analysis Dataset¶
save "$processed/analysis_panel.dta", replace. Use describe and summarize to document what's in the final dataset.
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
municipality~de str6 %-9s Municipality Code
municipality~me str29 %-9s Municipality Name
date double %td Date (Year-Month)
year_month float %tm Year-month
total_crimes long %12.0g Total crimes reported
assault_crimes long %12.0g Assault crimes reported
murder_crimes long %12.0g Murder/manslaughter crimes
reported
avg_median_st~e double %10.0g Average Median Standardized
Income
income_tercile byte %13.0g tercile_lbl
Income tercile (1=low, 2=mid,
3=high)
low_income float %9.0g Low income tercile (1=yes, 0=no)
mid_income float %9.0g Medium income tercile (1=yes,
0=no)
high_income float %9.0g High income tercile (1=yes, 0=no)
ret_compound float %9.0g Monthly compound return
ret_std_monthly float %9.0g Monthly standardized return (sum
of daily std ret)
ret_x_low float %9.0g Standardized return × Low income
ret_x_mid float %9.0g Standardized return × Medium
income
ret_x_high float %9.0g Standardized return × High income
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
total_crimes | 36,396 223.713 568.559 0 9239
assault_cr~s | 36,396 11.27756 28.59991 0 463
murder_cri~s | 36,396 .7485163 2.4274 0 46
avg_median~e | 36,396 23.50222 1.863055 19.83227 36.45822
ret_std_mo~y | 36,396 .6468432 4.149869 -12.38631 8.55268
-------------+---------------------------------------------------------
ret_x_low | 36,396 .216894 2.422353 -12.38631 8.55268
ret_x_mid | 36,396 .2149746 2.411696 -12.38631 8.55268
ret_x_high | 36,396 .2149746 2.411696 -12.38631 8.55268
file
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/06-assignment/data/processed/analysis_panel.dta saved
Observations: 36396
Variables: 20
File saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/privat
> e/assignments/06-assignment/data/processed/analysis_panel.dta
✅ Part A: Data Preparation COMPLETE
Dataset ready for regression analysis in Part B
✅ Task 6.5 tests passed
Part A Summary¶
Congratulations! You have completed Part A: Data Preparation.
What you accomplished:
- ✅ Section 1: Understood the research question from Huck (2024)
- ✅ Section 2: Loaded three raw datasets (crime, income, AEX)
- ✅ Section 3: Calculated monthly standardized returns following published methodology
- ✅ Section 4: Reshaped crime data from long to wide format
- ✅ Section 5: Merged datasets into municipality-month panel
- ✅ Section 6: Created analysis variables (terciles, interactions)
Final dataset: $processed/analysis_panel.dta
Panel structure:
- Unit: Municipality (muni_id)
- Time: Year-month (year_month)
- Observations: ~350 municipalities × ~96 months
Key variables for Part B:
- Outcome:
total_crimes,assault_crimes,murder_crimes - Treatment:
ret_std_monthly(standardized monthly return) - Moderator:
income_tercile,low_income,mid_income,high_income - Interactions:
ret_x_low,ret_x_mid,ret_x_high
Next steps (Part B):
- Descriptive statistics and visualizations
- Panel regression with fixed effects
- Testing differential effects across income groups
- Robustness checks
Your thesis relciation will follow this exact workflow:
- Find published paper with interesting result
- Obtain raw data (harder than it looks!)
- Clean and merge datasets (60-80% of your time)
- Replicate baseline result
- Test extension or alternative mechanism
- Discuss limitations honestly
Part B: Analysis¶
Now that we have prepared the data, we can analyze the relationship between stock returns and crime across income groups.
Section 7: Descriptive Statistics¶
Before running regressions, we examine the data through summary statistics and understand the distribution of our key variables.
Following standard empirical practice, we create:
- Summary statistics by income tercile (like Huck's Table 2)
- Return distribution statistics
- A formatted table for presentation
Task 7.1: Summary Statistics by Income Tercile¶
use "$processed/analysis_panel.dta", clear— load the final paneltabstat total_crimes assault_crimes murder_crimes, by(income_tercile) stat(n mean sd min max)- This shows N, mean, SD, min, max for each income group
Panel variable: muni_id (strongly balanced)
Time variable: year_month, 2012m1 to 2020m12
Delta: 1 month
Summary for variables: total_crimes assault_crimes murder_crimes
Group variable: income_tercile (Income tercile (1=low, 2=mid, 3=high))
income_tercile | N Mean SD Min Max
---------------+--------------------------------------------------
Low income | 12204.00 451.41 927.86 12.00 9239.00
| 12204.00 23.77 46.21 0.00 463.00
| 12204.00 1.63 3.93 0.00 46.00
---------------+--------------------------------------------------
Medium income | 12096.00 131.25 132.65 0.00 1159.00
| 12096.00 6.37 7.19 0.00 58.00
| 12096.00 0.39 0.78 0.00 8.00
---------------+--------------------------------------------------
High income | 12096.00 86.45 82.52 1.00 964.00
| 12096.00 3.58 3.83 0.00 39.00
| 12096.00 0.22 0.55 0.00 8.00
---------------+--------------------------------------------------
Total | 36396.00 223.71 568.56 0.00 9239.00
| 36396.00 11.28 28.60 0.00 463.00
| 36396.00 0.75 2.43 0.00 46.00
------------------------------------------------------------------
✅ Task 7.1 tests passed
Task 7.2: Summary Statistics for Returns¶
summarize, detail to see the full distribution including percentiles, skewness, and kurtosis. This helps identify outliers and understand the variation in market returns over the sample period.
Compound Monthly Returns:
Monthly compound return
-------------------------------------------------------------
Percentiles Smallest
1% -.101367 -.1037125
5% -.0590024 -.1037125
10% -.0470165 -.1037125 Obs 36,396
25% -.0185647 -.1037125 Sum of wgt. 36,396
50% .0115142 Mean .0072072
Largest Std. dev. .0393381
75% .0340431 .1351352
90% .0509977 .1351352 Variance .0015475
95% .0623475 .1351352 Skewness -.1718706
99% .0973073 .1351352 Kurtosis 3.780191
Standardized Monthly Returns:
Monthly standardized return (sum of daily std ret)
-------------------------------------------------------------
Percentiles Smallest
1% -9.791107 -12.38631
5% -7.349111 -12.38631
10% -4.55779 -12.38631 Obs 36,396
25% -1.753273 -12.38631 Sum of wgt. 36,396
50% 1.219513 Mean .6468432
Largest Std. dev. 4.149869
75% 3.87236 8.55268
90% 5.401238 8.55268 Variance 17.22141
95% 6.930266 8.55268 Skewness -.625603
99% 7.930677 8.55268 Kurtosis 3.264927
✅ Task 7.2 tests passed
Task 7.3: Create Formatted Summary Table¶
- Crime variables by income tercile
- Income and return variables for the full sample
- Mean, SD, Min, Max, and N for each variable
$tables/summary_stats.tex for use in reports.
estpost tabstat— prepare statistics for exportesttab using "filename.tex"— export to LaTeX- Use
cells("mean(fmt(2)) sd(fmt(2)) min max count")to specify statistics - Add
labeloption to use variable labels instead of names - Export as two separate files:
summary_stats_panelA.tex— crime statistics by income tercile (useby(income_tercile))summary_stats_panelB.tex— full sample statistics
(Note: Below code run with echo to enable preserve/restore functionality.)
. di as txt ""
. di as txt "Creating Panel A: Crime Statistics by Income Tercile..."
Creating Panel A: Crime Statistics by Income Tercile...
. preserve
. estpost tabstat total_crimes assault_crimes murder_crimes, by(income_tercile)
> statistics(mean sd min max count) columns(statistics)
Summary statistics: mean sd min max count
for variables: total_crimes assault_crimes murder_crimes
by categories of: income_tercile
income_terci | e(mean) e(sd) e(min) e(max) e(count)
-------------+-------------------------------------------------------
Low income |
total_crimes | 451.4092 927.8566 12 9239 12204
assault_cr~s | 23.7718 46.20558 0 463 12204
murder_cri~s | 1.633972 3.933833 0 46 12204
-------------+-------------------------------------------------------
Medium inc~e |
total_crimes | 131.2493 132.6499 0 1159 12096
assault_cr~s | 6.371197 7.192719 0 58 12096
murder_cri~s | .3865741 .7776529 0 8 12096
-------------+-------------------------------------------------------
High income |
total_crimes | 86.4475 82.52387 1 964 12096
assault_cr~s | 3.578125 3.831013 0 39 12096
murder_cri~s | .2170966 .5548765 0 8 12096
-------------+-------------------------------------------------------
Total |
total_crimes | 223.713 568.559 0 9239 36396
assault_cr~s | 11.27756 28.59991 0 463 36396
murder_cri~s | .7485163 2.4274 0 46 36396
. esttab using "$tables/summary_stats_panelA.tex", replace cells("mean(fmt(2))
> sd(fmt(2)) min(fmt(0)) max(fmt(0)) count(fmt(0))") noobs nonumber nomtitle ti
> tle("Panel A: Crime Statistics by Income Tercile")
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/output/tables/summary_stats_panelA.tex)
. restore
. di as txt "Creating Panel B: Full Sample Statistics..."
Creating Panel B: Full Sample Statistics...
. estpost summarize total_crimes assault_crimes murder_crimes avg_median_std_in
> come ret_compound ret_std_monthly
| e(count) e(sum_w) e(mean) e(Var) e(sd)
-------------+-------------------------------------------------------
total_crimes | 36396 36396 223.713 323259.3 568.559
assault_cr~s | 36396 36396 11.27756 817.9551 28.59991
murder_cri~s | 36396 36396 .7485163 5.89227 2.4274
avg_median~e | 36396 36396 23.50222 3.470975 1.863055
ret_compound | 36396 36396 .0072072 .0015475 .0393381
ret_std_mo~y | 36396 36396 .6468432 17.22141 4.149869
| e(min) e(max) e(sum)
-------------+---------------------------------
total_crimes | 0 9239 8142258
assault_cr~s | 0 463 410458
murder_cri~s | 0 46 27243
avg_median~e | 19.83227 36.45822 855386.9
ret_compound | -.1037125 .1351352 262.3129
ret_std_mo~y | -12.38631 8.55268 23542.5
. esttab using "$tables/summary_stats_panelB.tex", replace cells("mean(fmt(2))
> sd(fmt(2)) min(fmt(2)) max(fmt(2)) count(fmt(0))") noobs nonumber nomtitle ti
> tle("Panel B: Full Sample Statistics") label
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/output/tables/summary_stats_panelB.tex)
. di as txt "---- CHECKPOINT: Summary tables exported ----"
---- CHECKPOINT: Summary tables exported ----
. di as result "✅ Panel A saved: $tables/summary_stats_panelA.tex"
✅ Panel A saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/p
> rivate/assignments/06-assignment/output/tables/summary_stats_panelA.tex
. di as result "✅ Panel B saved: $tables/summary_stats_panelB.tex"
✅ Panel B saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/p
> rivate/assignments/06-assignment/output/tables/summary_stats_panelB.tex
.
✅ Task 7.3 tests passed
When examining summary statistics, pay attention to:
Crime levels across terciles:
- Do high-income areas have more or less crime than low-income areas?
- This affects interpretation—if high-income has zero crime, returns cannot reduce it further
Return distribution:
- Mean compound return tells us average market performance (should be slightly positive)
- SD shows volatility—higher SD means more variation to identify effects
- Standardized returns should have mean ≈ 0 and SD ≈ 1 by construction
Sample balance:
- Each tercile should have roughly N/3 observations
- Large imbalances suggest issues with tercile creation
Section 8: Visualize Crime and Returns Over Time¶
Visual inspection is an essential first step before formal statistical tests. We create three plots:
- National crime trends — How has total crime evolved over time?
- Crime and returns overlay — Do they move together visually?
- Crime by income tercile — Do trends differ across income groups?
These plots provide intuition before we estimate regressions.
Task 8.1: National Crime Time Series¶
crime_timeseries.png.
preserve— save current dataset statecollapse (sum) total_crimes, by(year_month)— aggregate to national leveltwoway line total_crimes year_month— create line plotgraph export "$figures/crime_timeseries.png", replace width(1200)— save figurerestore— return to original dataset
(Note: Below code run with echo to enable preserve/restore functionality.)
. preserve
. collapse (sum) total_crimes assault_crimes murder_crimes, by(year_month)
. twoway (line total_crimes year_month, lcolor(navy) lwidth(medium)), title("To
> tal Crime in the Netherlands", size(medium)) subtitle("Monthly aggregates acr
> oss all municipalities", size(small)) xtitle("Year-Month") ytitle("Total Crim
> es Reported") xlabel(, angle(45) labsize(small)) ylabel(, labsize(small) form
> at(%12.0fc)) graphregion(color(white)) plotregion(color(white)) note("Source:
> CBS/Politie crime data, 2012-2019", size(vsmall))
. graph export "$figures/crime_timeseries.png", replace width(1200)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/06-assignment/output/figures/crime_timeseries.png written in PNG forma
> t
. di as txt "---- CHECKPOINT: National crime time series ----"
---- CHECKPOINT: National crime time series ----
. di as result "✅ Figure saved: $figures/crime_timeseries.png"
✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
> ivate/assignments/06-assignment/output/figures/crime_timeseries.png
. di as result "Months plotted: " _N
Months plotted: 108
. restore
.
✅ Task 8.1 tests passed
Task 8.2: Overlay Crime and Returns¶
For the standardization, create variables named crime_z and ret_z (z-scores for crime and returns respectively). Add a horizontal reference line at zero, and include a legend labelling the two series as "Total Crime (standardized)" and "AEX Returns (standardized)". Save the figure as crime_returns_overlay.png.
- Crime: counts ranging from ~10,000 to ~50,000 per month
- Returns: percentages ranging from -10% to +10%
By standardizing both (z-scores), we can plot them on the same axis and see whether they move together. A standardized value of +1 means "one standard deviation above average" for that series.
If crime and returns move in opposite directions (negative correlation), we should see peaks in one series align with troughs in the other.
preserve— save current dataset statecollapse (sum) total_crimes (mean) ret_std_monthly, by(year_month)— aggregate to national levelegen mean_crime = mean(total_crimes)— calculate mean for z-scoreegen sd_crime = sd(total_crimes)— calculate SD for z-scoregen crime_z = (total_crimes - mean_crime) / sd_crime— standardize- Repeat for returns: create
mean_ret,sd_ret, andret_z twoway (line crime_z year_month) (line ret_z year_month)— overlay both seriesgraph export "$figures/crime_returns_overlay.png", replace width(1200)restore— return to original dataset
(Note: Below code run with echo to enable preserve/restore functionality.)
. preserve
. collapse (sum) total_crimes (mean) ret_std_monthly, by(year_month)
. egen mean_crime = mean(total_crimes)
. egen sd_crime = sd(total_crimes)
. gen crime_z = (total_crimes - mean_crime) / sd_crime
. egen mean_ret = mean(ret_std_monthly)
. egen sd_ret = sd(ret_std_monthly)
. gen ret_z = (ret_std_monthly - mean_ret) / sd_ret
. twoway (line crime_z year_month, lcolor(maroon) lwidth(medium) lpattern(solid
> )) (line ret_z year_month, lcolor(navy) lwidth(medium) lpattern(dash)), title
> ("Crime and Stock Returns Over Time", size(medium)) subtitle("Standardized (z
> -scores) for visual comparison", size(small)) xtitle("Year-Month") ytitle("St
> andard Deviations from Mean") xlabel(, angle(45) labsize(small)) ylabel(, lab
> size(small)) legend(label(1 "Total Crime (standardized)") label(2 "AEX Return
> s (standardized)") rows(1) position(6) size(small)) graphregion(color(white))
> plotregion(color(white)) yline(0, lcolor(gray) lpattern(dot)) note("Both ser
> ies standardized to mean=0, SD=1 for comparability", size(vsmall))
. graph export "$figures/crime_returns_overlay.png", replace width(1200)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/06-assignment/output/figures/crime_returns_overlay.png written in PNG
> format
. di as txt "---- CHECKPOINT: Crime and returns overlay ----"
---- CHECKPOINT: Crime and returns overlay ----
. di as result "✅ Figure saved: $figures/crime_returns_overlay.png"
✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
> ivate/assignments/06-assignment/output/figures/crime_returns_overlay.png
. restore
.
✅ Task 8.2 tests passed
Task 8.3: Crime Time Series by Income Tercile¶
First collapse the data to the tercile-month level (sum of total_crimes by year_month and income_tercile). Include a legend with labels "Low Income (Tercile 1)", "Medium Income (Tercile 2)", and "High Income (Tercile 3)". Save the figure as crime_by_income.png.
preservecollapse (sum) total_crimes, by(year_month income_tercile)- Use
twowaywith threelineplots, each filtered byif income_tercile==1,==2,==3 graph export "$figures/crime_by_income.png", replace width(1200)restore
(Note: Below code run with echo to enable preserve/restore functionality.)
. preserve
. collapse (sum) total_crimes, by(year_month income_tercile)
. twoway (line total_crimes year_month if income_tercile==1, lcolor(red) lwidth
> (medium) lpattern(solid)) (line total_crimes year_month if income_tercile==2,
> lcolor(orange) lwidth(medium) lpattern(dash)) (line total_crimes year_month
> if income_tercile==3, lcolor(green) lwidth(medium) lpattern(longdash)), title
> ("Crime Trends by Income Tercile", size(medium)) subtitle("Total crime aggreg
> ated by municipality income group", size(small)) xtitle("Year-Month") ytitle(
> "Total Crimes Reported") xlabel(, angle(45) labsize(small)) ylabel(, labsize(
> small) format(%12.0fc)) legend(label(1 "Low Income (Tercile 1)") label(2 "Med
> ium Income (Tercile 2)") label(3 "High Income (Tercile 3)") rows(1) position(
> 6) size(small)) graphregion(color(white)) plotregion(color(white)) note("Sour
> ce: CBS/Politie crime data by municipality income tercile", size(vsmall))
. graph export "$figures/crime_by_income.png", replace width(1200)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/06-assignment/output/figures/crime_by_income.png written in PNG format
. di as txt "---- CHECKPOINT: Crime by income tercile ----"
---- CHECKPOINT: Crime by income tercile ----
. di as result "✅ Figure saved: $figures/crime_by_income.png"
✅ Figure saved: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/pr
> ivate/assignments/06-assignment/output/figures/crime_by_income.png
. restore
.
✅ Task 8.3 tests passed
Figure 1 (National trend): Is crime increasing, decreasing, or stable? Are there seasonal patterns?
Figure 2 (Overlay): Do crime and returns move together (positive correlation) or in opposite directions (negative correlation)? Remember: we expect crime to decrease when returns are high (negative relationship).
Figure 3 (By tercile): Do the three income groups have similar levels and trends? High-income areas typically have lower crime levels. If high-income municipalities have near-zero crime, it may be difficult to detect a relationship with returns (floor effect).
Important: Visual patterns are suggestive but not conclusive. We need regression analysis (next sections) to control for confounders and test statistical significance.
Section 9: Baseline Panel Regression¶
We now estimate our first formal statistical tests. The baseline regressions estimate the average effect of stock returns on crime across all municipalities, without distinguishing by income group.
Specification:
$$\text{Crime}_{it} = \beta_0 + \beta_1 \text{Returns}_t + \alpha_i + \varepsilon_{it}$$
Where:
- $\text{Crime}_{it}$ = Total crimes in municipality $i$ at month $t$
- $\text{Returns}_t$ = Monthly AEX return (compound or standardized)
- $\alpha_i$ = Municipality fixed effects (absorbs time-invariant municipality characteristics)
- $\varepsilon_{it}$ = Error term (clustered at municipality level)
Key question: Is $\beta_1$ positive or negative? Does higher return correlate with more or less crime on average?
Task 9.1: Load Analysis Panel and Verify Structure¶
$processed/analysis_panel.dta and verify the panel structure is correctly set. Check that we have the right number of municipalities and time periods before running regressions.
Panel variable: muni_id (strongly balanced)
Time variable: year_month, 2012m1 to 2020m12
Delta: 1 month
muni_id: 1, 2, ..., 337 n = 337
year_month: 2012m1, 2012m2, ..., 2020m12 T = 108
Delta(year_month) = 1 month
Span(year_month) = 108 periods
(muni_id*year_month uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
108 108 108 108 108 108 108
Freq. Percent Cum. | Pattern
---------------------------+--------------------------------------------------
> ------------------------------------------------------------
337 100.00 100.00 | 111111111111111111111111111111111111111111111111
> 111111111111111111111111111111111111111111111111111111111111
---------------------------+--------------------------------------------------
> ------------------------------------------------------------
337 100.00 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
---- CHECKPOINT: Panel structure verified ----
Panel variable: muni_id
Time variable: year_month
Total observations: 36396
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
total_crimes long %12.0g Total crimes reported
ret_compound float %9.0g Monthly compound return
ret_std_monthly float %9.0g Monthly standardized return (sum
of daily std ret)
income_tercile byte %13.0g tercile_lbl
Income tercile (1=low, 2=mid,
3=high)
✅ Task 9.1 tests passed
Task 9.2: Estimate Baseline with Compound Returns¶
xtreg depvar indepvar, fe— fixed effects regressionvce(cluster muni_id)— cluster standard errors by municipalityestimates store modelname— save estimates for later comparison- Coefficient interpretation: A 1 percentage point increase in returns → change of β₁ crimes
========================================
Baseline Model: Compound Returns
========================================
Fixed-effects (within) regression Number of obs = 36,396
Group variable: muni_id Number of groups = 337
R-squared: Obs per group:
Within = 0.0002 min = 108
Between = . avg = 108.0
Overall = 0.0000 max = 108
F(1, 336) = 36.31
corr(u_i, Xb) = 0.0000 Prob > F = 0.0000
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
total_crimes | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_compound | 33.71929 5.595892 6.03 0.000 22.71189 44.72668
_cons | 223.47 .0403306 5540.95 0.000 223.3906 223.5493
-------------+----------------------------------------------------------------
sigma_u | 562.89334
sigma_e | 86.074038
rho | .97715169 (fraction of variance due to u_i)
------------------------------------------------------------------------------
---- CHECKPOINT: Baseline regression (compound) ----
Coefficient on ret_compound: 33.719288
Standard error: 5.5958915
t-statistic: 6.0257223
Interpretation:
→ Positive coefficient: Higher returns associated with MORE crime
✅ Task 9.2 tests passed
Task 9.3: Estimate Baseline with Standardized Returns¶
ret_std_monthly) instead of compound returns. Use the same fixed effects and clustered standard errors as before: xtreg total_crimes ret_std_monthly, fe vce(cluster muni_id). This follows Huck (2024)'s methodology more closely. Compare the results: do both return measures give similar findings about the average relationship? Store the estimates as m1_std.
========================================
Baseline Model: Standardized Returns
========================================
Fixed-effects (within) regression Number of obs = 36,396
Group variable: muni_id Number of groups = 337
R-squared: Obs per group:
Within = 0.0010 min = 108
Between = . avg = 108.0
Overall = 0.0000 max = 108
F(1, 336) = 43.37
corr(u_i, Xb) = -0.0000 Prob > F = 0.0000
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
total_crimes | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_std_mo~y | .6651736 .1009989 6.59 0.000 .4665037 .8638435
_cons | 223.2827 .0653305 3417.74 0.000 223.1542 223.4112
-------------+----------------------------------------------------------------
sigma_u | 562.89334
sigma_e | 86.039671
rho | .97716951 (fraction of variance due to u_i)
------------------------------------------------------------------------------
---- CHECKPOINT: Baseline regression (standardized) ----
Coefficient on ret_std_monthly: .6651736
Standard error: .10099893
t-statistic: 6.5859472
Interpretation:
→ A 1 SD increase in returns → change of .6651736 crimes
✅ Task 9.3 tests passed
Model 1 (Compound Returns):
Model 2 (Standardized Returns):
What to expect: Both should show the same sign (positive or negative) but different magnitudes. The standardized version is preferred for comparability to Huck (2024).
Important limitation: These baseline models estimate the average effect across all income groups. They do not test Huck's key hypothesis about differential effects. That comes in Section 10.
Section 10: Differential Effects by Income (Core Test)¶
This is the key test of Huck's hypothesis: Do stock returns affect crime differently in high-income vs. low-income areas?
We use standardized returns for the interaction analysis, following Huck (2024). The compound return baseline from Section 9 serves as a naive comparison in the final regression table.
Huck's (2024) finding:
- High-income areas (investors): Returns ↑ → Crime ↓ (negative coefficient)
- Low-income areas (non-investors): Returns ↑ → Crime ↑ (positive coefficient)
Mechanism: Relative wealth/status affects psychological well-being. When the market rises:
- Investors feel wealthier → better mood → less crime
- Non-investors feel relatively poorer → worse mood → more crime
Our specification (with interactions):
$$\text{Crime}_{it} = \beta_0 + \beta_1 \text{Returns}_t + \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \varepsilon_{it}$$
Where:
- $\beta_1$ = Effect in medium-income areas (reference group)
- $\beta_1 + \beta_2$ = Effect in high-income areas
- $\beta_1 + \beta_3$ = Effect in low-income areas
Key test: Is $\beta_2 < 0$ (high-income negative) and $\beta_3 > 0$ (low-income positive)?
Task 10.1: Estimate Interactions with Standardized Returns¶
ret_x_high, ret_x_mid, and ret_x_low were already created in Part A (Section 6). Use municipality fixed effects with clustered standard errors: xtreg total_crimes ret_x_low ret_x_mid ret_x_high, fe vce(cluster muni_id). Store the estimates as m2_std. This specification follows Huck (2024)'s methodology exactly.
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
ret_x_high float %9.0g Standardized return × High income
ret_x_mid float %9.0g Standardized return × Medium
income
ret_x_low float %9.0g Standardized return × Low income
========================================
Interaction Model: Standardized Returns
========================================
Fixed-effects (within) regression Number of obs = 36,396
Group variable: muni_id Number of groups = 337
R-squared: Obs per group:
Within = 0.0015 min = 108
Between = 0.0835 avg = 108.0
Overall = 0.0010 max = 108
F(3, 336) = 55.37
corr(u_i, Xb) = 0.0256 Prob > F = 0.0000
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
total_crimes | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_x_low | 1.30699 .2843909 4.60 0.000 .7475796 1.866401
ret_x_mid | .4399819 .0600953 7.32 0.000 .3217714 .5581924
ret_x_high | .2428179 .0254009 9.56 0.000 .1928532 .2927826
_cons | 223.2827 .0632572 3529.76 0.000 223.1583 223.4072
-------------+----------------------------------------------------------------
sigma_u | 562.80676
sigma_e | 86.02041
rho | .97717264 (fraction of variance due to u_i)
------------------------------------------------------------------------------
---- CHECKPOINT: Interaction model (standardized) ----
Effect by income group (per 1 SD return):
Low-income: 1.3069905
Mid-income: .43998192
High-income: .24281791
✗ Both positive: Returns increase crime in all groups
✅ Task 10.2 tests passed
Task 10.2: Interpret the Differential Effects¶
- Are the coefficients for high-income and low-income areas statistically significant?
- What are the signs of the coefficients? Do they match Huck's findings?
- What is the economic magnitude? (How many crimes per SD of return?)
- Why might the pattern differ from Huck's US findings?
Using daily data from 2,700 police agencies (1991-2015):
- High-income areas (investors): A 1 SD return increase → 37 bps decrease in violent crime (−12.5 crimes per 100M population)
- Low-income areas (noninvestors): A 1 SD return increase → 25 bps increase in violent crime (+15.2 crimes per 100M population)
- The difference between high and low income coefficients is statistically significant (t = 3.31)
What you should look for:
Why results might differ:
Section 11: Two-Way Fixed Effects¶
So far we have used one-way fixed effects (municipality FE only). This controls for time-invariant municipality characteristics (e.g., urban vs rural, demographics) but does not control for time shocks that affect all municipalities simultaneously.
Two-way fixed effects (TWFE) adds time fixed effects to absorb common shocks:
$$\text{Crime}_{it} = \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \gamma_t + \varepsilon_{it}$$
Where:
- $\alpha_i$ = Municipality fixed effects
- $\gamma_t$ = Time (year-month) fixed effects
- Note: The main return effect $\beta_1 \text{Returns}_t$ is absorbed by time FE because all municipalities have the same return in month $t$
What time FE controls for:
- Seasonal crime patterns (more crime in summer?)
- National policy changes (new policing strategies)
- Economic conditions (unemployment, GDP growth)
- Any factor that affects all municipalities equally in a given month
Key advantage: Better identification of differential effects across income groups by removing confounding time trends.
Task 11.1: Estimate TWFE with reghdfe¶
reghdfe. This package efficiently handles high-dimensional fixed effects (many municipalities × many months). Include only the interaction terms, as the main return effect is absorbed by time FE.
reghdfe depvar indepvars, absorb(fe1 fe2) vce(cluster var)absorb(muni_id year_month)— absorbs both municipality and time FE- Do NOT include main return effect (it's collinear with time FE)
- Only include interactions that vary within time periods (income × returns)
========================================
Two-Way Fixed Effects Model
========================================
Note: Main return effect absorbed by time FE
(all municipalities have same return each month)
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 36,396
Absorbing 2 HDFE groups F( 2, 336) = 11.10
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.9805
Adj R-squared = 0.9802
Within R-sq. = 0.0006
Number of clusters (muni_id) = 337 Root MSE = 79.9471
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
total_crimes | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_x_high | -.197164 .0653383 -3.02 0.003 -.3256876 -.0686404
ret_x_low | .8670085 .2910952 2.98 0.003 .2944098 1.439607
_cons | 223.5673 .0672407 3324.88 0.000 223.4351 223.6996
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
muni_id | 337 337 0 *|
year_month | 108 1 107 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
---- CHECKPOINT: Two-way FE model ----
Effect by income group (relative to mid-income):
High-income: -.19716402
Low-income: .86700855
Fixed effects absorbed:
Municipalities: .
Time periods: .
✅ Task 11.1 tests passed
Task 11.2: Compare One-Way vs Two-Way Fixed Effects¶
esttab to create a clean comparison table showing both specifications.
========================================
Comparison: One-Way FE vs Two-Way FE
========================================
Effect of Standardized Returns on Crime by Income Group
----------------------------------------------------
(1) (2)
One-Way FE Two-Way FE
----------------------------------------------------
Monthly standardiz~m
Standardized retur~e 1.307*** 0.867***
(0.284) (0.291)
Standardized retur~c 0.440***
(0.060)
Standardized retur~m 0.243*** -0.197***
(0.025) (0.065)
----------------------------------------------------
Observations 36,396 36,396
Adjusted R² 0.001 0.980
----------------------------------------------------
Standard errors in parentheses
* p<0.10, ** p<0.05, *** p<0.01
---- CHECKPOINT: Model comparison ----
Key differences:
• One-Way FE: Controls for municipality characteristics only
• Two-Way FE: Controls for municipality + time shocks
Questions to consider:
1. Do coefficients change substantially?
2. Do significance levels change?
3. Which specification is more credible?
✅ Task 11.2 tests passed
- Identification: Effects now identified purely from differential responses across income groups (within-month variation)
- Confounders removed: Any national trend (seasonality, policy changes, economic conditions) is absorbed
- Main effect absorbed: Cannot estimate average return effect (all municipalities have same return each month)
How to interpret coefficient changes:
Which is more credible?
Two-way FE is generally preferred because:
However, TWFE has limitations:
Section 12: Robustness Checks¶
Good empirical work tests whether findings are robust to alternative specifications. If results change dramatically across reasonable alternatives, they may be fragile or driven by specific modeling choices.
We test three alternative specifications:
- Log specification — Test whether effects are proportional (% changes) rather than absolute
- Assault crimes only — Focus on violent crime (closer to Huck's dependent variable)
- Volatility instead of returns — Test whether market uncertainty (not direction) affects crime
What makes a good robustness check?
- Addresses a legitimate alternative hypothesis
- Changes one thing at a time (outcome, specification, or sample)
- Has a clear interpretation if results differ
All robustness checks use the TWFE specification (municipality + time FE) from Section 11.
Task 12.1: Log Specification¶
reghdfe using absorb(muni_id year_month) and vce(cluster muni_id). Store the estimates as m4_log. Name the log variable `log_total_crimes` (use `ln(total_crimes + 1)` to handle zeros).
- Levels: β measures change in crime counts (e.g., 10 more crimes)
- Logs: β measures change in percentage (e.g., 2% increase)
When logs are preferred:
Technical note: We use ln(crime + 1) to handle zeros (some municipalities have zero assault/murder in some months).
========================================
Robustness Check 1: Log Specification
========================================
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 36,396
Absorbing 2 HDFE groups F( 2, 336) = 0.54
Statistics robust to heteroskedasticity Prob > F = 0.5830
R-squared = 0.9676
Adj R-squared = 0.9672
Within R-sq. = 0.0000
Number of clusters (muni_id) = 337 Root MSE = 0.1909
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
log_total_~s | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_x_high | .0001186 .0005546 0.21 0.831 -.0009723 .0012095
ret_x_low | -.0003916 .0004915 -0.80 0.426 -.0013584 .0005752
_cons | 4.680233 .0001938 2.4e+04 0.000 4.679852 4.680615
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
muni_id | 337 337 0 *|
year_month | 108 1 107 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
---- CHECKPOINT: Log specification ----
Interpretation:
Coefficients now measure % changes in crime
High-income: 0.0001 (1 SD return → 0.01% change)
Low-income: -0.0004 (1 SD return → -0.04% change)
✅ Task 12.1 tests passed
The level specification asks: "Do returns cause a different number of crimes across income groups?" — and finds significant effects.
The log specification asks: "Do returns cause a different percentage change in crime across income groups?" — and finds nothing.
This happens because large municipalities drive the level results. In levels, a municipality with 5,000 monthly crimes contributes far more variance than one with 20 crimes. The log transformation compresses the scale and equalizes the influence of small and large municipalities — and the effect vanishes.
This tells us the effect is about absolute crime counts shifting in specific large municipalities, not about crime rates proportionally changing across all income groups. Keep this distinction in mind when interpreting your results and discussing limitations.
Task 12.2: Assault Crimes Only¶
reghdfe, absorb(muni_id year_month), and vce(cluster muni_id). Use assault_crimes as the dependent variable instead of total_crimes. Store the estimates as m5_assault. This tests whether the psychological mechanism primarily affects violent vs property crime.
========================================
Robustness Check 2: Assault Crimes
========================================
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
assault_cr~s | 36,396 11.27756 28.59991 0 463
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 36,396
Absorbing 2 HDFE groups F( 2, 336) = 1.41
Statistics robust to heteroskedasticity Prob > F = 0.2466
R-squared = 0.9677
Adj R-squared = 0.9672
Within R-sq. = 0.0001
Number of clusters (muni_id) = 337 Root MSE = 5.1757
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
assault_cr~s | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ret_x_high | -.0009849 .0071113 -0.14 0.890 -.0149731 .0130033
ret_x_low | .026294 .0166023 1.58 0.114 -.0063635 .0589514
_cons | 11.27207 .0042482 2653.40 0.000 11.26371 11.28042
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
muni_id | 337 337 0 *|
year_month | 108 1 107 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
---- CHECKPOINT: Assault crimes specification ----
Question: Does focusing on violent crime change results?
High-income effect: -.00098488
Low-income effect: .02629398
✓ Pattern holds for violent crime (assault)
✅ Task 12.2 tests passed
Task 12.3: Volatility Instead of Returns¶
reghdfe total_crimes vol_high vol_low, absorb(muni_id year_month) vce(cluster muni_id). Store the estimates as m6_volatility. This tests whether psychological effects come from uncertainty rather than wealth changes.
- Returns (direction): Investors feel richer when market rises → better mood → less crime
- Volatility (uncertainty): Market turbulence creates stress/anxiety → worse mood → more crime
Expected pattern if volatility matters:
Higher volatility → more crime in high-income areas (investors stressed by uncertainty)
Variable used: volatility = within-month SD of daily returns (calculated in Section 3)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
volatility | 36,396 .0094554 .0050592 .003573 .0422672
vol_high | 36,396 .0031425 .005324 0 .0422672
vol_low | 36,396 .0031705 .0053394 0 .0422672
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 36,396
Absorbing 2 HDFE groups F( 2, 336) = 4.31
Statistics robust to heteroskedasticity Prob > F = 0.0142
R-squared = 0.9805
Adj R-squared = 0.9802
Within R-sq. = 0.0006
Number of clusters (muni_id) = 337 Root MSE = 79.9474
(Std. err. adjusted for 337 clusters in muni_id)
------------------------------------------------------------------------------
| Robust
total_crimes | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
vol_high | 56.42181 51.53063 1.09 0.274 -44.94149 157.7851
vol_low | -769.2239 298.4051 -2.58 0.010 -1356.201 -182.2463
_cons | 225.9745 .9781892 231.01 0.000 224.0504 227.8987
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
muni_id | 337 337 0 *|
year_month | 108 1 107 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
---- CHECKPOINT: Volatility specification ----
Question: Does market uncertainty (not direction) affect crime?
High-income: 56.42181
Low-income: -769.2239
High volatility → more crime in high-income areas (stress channel)
✅ Task 12.3 tests passed
If results are similar across specifications:
→ Findings are robust; not driven by specific modeling choices
→ Confidence in main result increases
If results differ substantially:
→ Effects may be heterogeneous (e.g., only for violent crime)
→ Or main specification had issues (e.g., skewed outcome needed logs)
→ Discuss why differences arise and which specification is preferred
If volatility matters more than returns:
→ Uncertainty/stress channel dominates wealth effect
→ Challenges Huck's specific mechanism
Common practice: Present main result (TWFE with levels) in text, show robustness in appendix table. If robustness checks reveal important heterogeneity, discuss in main text.
Section 13: Regression Table¶
We now create a publication-quality regression table combining all our main results. This table summarizes the entire empirical analysis in a format suitable for a thesis or research paper.
Good regression tables should:
- Show coefficients with standard errors (in parentheses)
- Use stars to indicate significance levels (*** p<0.01, ** p<0.05, * p<0.10)
- Report key model statistics (N, R², fixed effects)
- Have clear column headers describing each specification
- Be formatted for easy comparison across models
We will create a table with the following models:
- Column 1: Compound baseline (compound returns, one-way FE) — naive benchmark
- Column 2: Standardized baseline (standardized returns, one-way FE)
- Column 3: Interactions (standardized returns, one-way FE)
- Column 4: TWFE (standardized returns, municipality + time FE)
- Column 5: Log specification (TWFE)
- Column 6: Assault crimes (TWFE)
Column 1 uses raw compound returns as a naive benchmark. Columns 2–6 use standardized returns following Huck (2024), which account for time-varying volatility.
This table will be exported as LaTeX for inclusion in reports.
Task 13.1: Create Publication-Quality Regression Table¶
esttab to combine all stored estimates into a professional table. Export to both screen display (for review) and LaTeX format (for thesis). Include coefficient estimates, standard errors, significance stars, and model statistics.
esttab:se(3)— show standard errors with 3 decimalsstar(* 0.10 ** 0.05 *** 0.01)— significance starsmtitles("Model 1" "Model 2" ...)— column headerskeep(varlist)— which variables to showorder(varlist)— row orderstats(N r2_a, labels("Observations" "Adjusted R²"))— model statisticslabel— use variable labels instead of names
========================================
Main Regression Results Table
========================================
Table 1: Effect of Stock Returns on Crime by Income Group
-------------------------------------------------------------------------------
> -------------------------------------
Compound Std. Basel~e Interactions TW
> FE Log Assault
-------------------------------------------------------------------------------
> -------------------------------------
Monthly compound r~n 33.72***
>
(5.596)
>
Monthly standardiz~m 0.665***
>
(0.101)
>
Standardized retur~e 1.307*** 0.8
> 67*** -0.000392 0.0263
(0.284) (0.29
> 1) (0.000) (0.017)
Standardized retur~c 0.440***
>
(0.060)
>
Standardized retur~m 0.243*** -0.1
> 97*** 0.000119 -0.000985
(0.025) (0.06
> 5) (0.001) (0.007)
-------------------------------------------------------------------------------
> -------------------------------------
Observations 36,396 36,396 36,396 36,3
> 96 36,396 36,396
Adjusted R² 0.000 0.001 0.001 0.9
> 80 0.967 0.967
-------------------------------------------------------------------------------
> -------------------------------------
Standard errors in parentheses
* p<0.10, ** p<0.05, *** p<0.01
Notes for table:
• Dependent variable: Total crimes (Columns 1-4), Log(total crimes) (Column 5)
> , Assault crimes (Column 6)
• Column 1 uses compound returns; Columns 2-6 use standardized returns (Huck 2
> 024)
• All models include municipality fixed effects
• Columns 4-6 include time (year-month) fixed effects
• Standard errors clustered at municipality level in parentheses
• * p<0.10, ** p<0.05, *** p<0.01
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/06-assignment/output/tables/main_results.tex)
---- CHECKPOINT: Regression table created ----
✅ Table exported to: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva
> te/private/assignments/06-assignment/output/tables/main_results.tex
Models included: 6 (Compound baseline through Assault)
✅ Task 13.1 tests passed
If you want to create a separate volatility table, use:
esttab m6_volatility using "$tables/volatility_results.tex"
This keeps your main results table clean and focused on the return → crime relationship.
Column progression (left to right):
What to look for:
For your thesis: This table would go in the main body. The volatility table (if created) would go in an appendix. Always accompany tables with 2-3 paragraphs of interpretation in text.
Section 14: Interpretation and Limitations (Ungraded and voluntary)¶
The empirical analysis is complete. Now we interpret the findings and honestly discuss limitations.
Good empirical papers have three key elements:
- Clear interpretation of what the results mean economically and statistically
- Honest discussion of limitations (what the study cannot test or conclude)
- Constructive suggestions for how future work could improve identification
Why limitations matter:
- Prevents over-claiming (saying more than the data can support)
- Shows you understand the research design deeply
- Guides readers on how to interpret your findings
- In your thesis, examiners expect thoughtful discussion of what you cannot claim
This section has three writing tasks that synthesize everything you've learned.
Task 14.1: Summarize Your Findings¶
- Main result: Do you find differential effects of returns across income groups?
- Signs: Are the signs consistent with Huck (2024)'s theory? (High negative, low positive?)
- Statistical significance: Are the key coefficients statistically significant?
- Economic magnitude: How large are the effects? (e.g., "A 1 SD positive return reduces crime by X incidents in high-income areas")
- Robustness: Do results hold across specifications (TWFE, log, assault)?
- Comparison to Huck: Are your findings consistent with or different from the US results?
Your findings summary (minimum 200 words):
(Write your answer below)
Task 14.2: Discuss Limitations¶
- Explain what the limitation is
- Why it matters (what it prevents you from concluding)
- How it differs from the original Huck (2024) study
Suggested limitations to consider:
What NOT to do:
What TO do:
Your limitations discussion (minimum 150 words, at least 3 points):
(Write your answer below)
Limitation 1:
Limitation 2:
Limitation 3:
Task 14.3: What Would Strengthen This Analysis?¶
- Better data: What data would you ideally have?
- Natural experiments: Are there shocks you could exploit for causal identification?
- Alternative mechanisms: What else could you test?
- Extensions: How could you expand this to related questions?
Examples to consider:
Your suggestions for improvement (minimum 100 words, at least 3 ideas):
(Write your answer below)
Improvement 1:
Improvement 2:
Improvement 3:
Part B Summary¶
Congratulations! You have completed Part B: Analysis.
What you accomplished:
- ✅ Section 7: Created descriptive statistics by income tercile
- ✅ Section 8: Visualized crime and return trends over time
- ✅ Section 9: Estimated baseline panel regressions
- ✅ Section 10: Tested differential effects by income (core hypothesis)
- ✅ Section 11: Added two-way fixed effects for better identification
- ✅ Section 12: Conducted robustness checks (log, assault, volatility)
- ✅ Section 13: Created publication-quality regression table
- ✅ Section 14: Interpreted findings and discussed limitations honestly
Outputs created:
- Tables:
summary_stats_panelA.tex,summary_stats_panelB.tex,main_results.tex - Figures:
crime_timeseries.png,crime_returns_overlay.png,crime_by_income.png - Stored estimates: 6 regression models ready for comparison
Skills practiced:
- Panel data regression with fixed effects
- Testing interaction effects
- Two-way fixed effects estimation
- Robustness checks
- Creating professional tables and figures
- Interpreting results and discussing limitations
Your thesis Module 2/3 will look very similar:
- Part A (Data Preparation): 60-70% of your time
- Part B (Analysis): 20-30% of your time
- Part C (Writing/Interpretation): 10% of your time
The skills you practiced here—data wrangling, regression analysis, robustness checks, honest limitation discussion—are exactly what you'll need for your MSc thesis.
Part C: Thesis Connection and Reflection¶
This final section connects your work to the MSc thesis and reflects on what you've learned.
Section 15: Thesis Module 2/3 Connection¶
Your MSc Finance thesis has three modules:
- Module 1: Literature review and research question
- Module 2: Replication of published results with new/similar data
- Module 3: Extension or new analysis building on Module 2
This assignment mirrors the Module 2/3 workflow. You've essentially completed a mini-thesis:
- Read a published paper (Huck 2024)
- Replicated the methodology with new data (Netherlands instead of US)
- Tested whether findings generalize to a new setting
Section 16: Conclusion¶
You have completed a full replication/extension exercise from start to finish.
What you accomplished:
Part A (Data Preparation):
- Loaded and examined three raw datasets
- Calculated monthly returns following published methodology
- Reshaped data from long to wide format
- Merged multiple datasets into a panel
- Created analysis variables (terciles, interactions)
Part B (Analysis):
- Created descriptive statistics and visualizations
- Estimated baseline and interaction regressions
- Tested differential effects across income groups (Huck's hypothesis)
- Added two-way fixed effects for better identification
- Conducted robustness checks
- Created publication-quality tables
- Interpreted results and discussed limitations honestly
Key lessons:
Data preparation is 60-80% of empirical work — You spent most time in Part A, not Part B. This is normal and expected.
Published papers use better data than you'll have — Huck has daily crime data and actual investor accounts. You have monthly aggregates and income proxies. This is the reality of replication work.
Honest limitation discussion is a strength — Acknowledging what you cannot test shows deep understanding of research design.
Even null results are informative — If you don't find Huck's pattern in NL data, that's a contribution (generalizability is limited).
Extensions matter more than perfect replication — Your thesis should test something new, even if it's a simple extension.
For your MSc thesis:
- Start data preparation early (it always takes longer than expected)
- Be realistic about what you can test with available data
- Focus on a simple, clear extension rather than perfect replication
- Discuss limitations honestly (examiners value this highly)
Final thought:
Research is about answering questions with imperfect data and honest acknowledgment of limitations. You've practiced exactly that in this assignment. These skills—data wrangling, regression analysis, robustness testing, limitation discussion—will serve you well in your thesis and beyond.
Good luck with your MSc thesis! 🎓
Data Analytics for Finance
BM17FI · Academic Year 2025–26
Created by: Caspar David Peter
© 2026 Rotterdam School of Management