Data Analytics for Finance
| Assignment | Release | Deadline | Graded |
|---|---|---|---|
| A1 | 11.02.2026 | 25.02.2026 | \(\checkmark\) |
| A2 | 11.02.2026 | 25.02.2026 | \(\checkmark\) |
| A3 | 18.02.2026 | 04.03.2026 | \(\checkmark\) |
| A4 | 25.02.2026 | 11.03.2026 | |
| A5 | 04.03.2026 | 18.03.2026 | |
| A6 | 11.03.2026 | 25.03.2026 |
Assignment 6!
A6 is the replication exercise. You will replicate Huck (2024) using Dutch data. Everything we discuss today feeds directly into it. Deadline: March 25, 2026.
Why this matters
You are about to do exactly what thesis Module 2 asks: read a published paper, replicate the methodology with new data, and evaluate whether findings generalize.
Each lecture added a layer to your empirical toolkit. Today we see how they work together in a single published paper and in your own replication.
merge, reshape, collapse, gen/replaceRemember this?
In every assignment, you spent 60–70% of your time on data preparation. That’s research. Facts!
The principle
If your regression tells you something your visualization doesn’t support. Trust the visualization and debug the regression.
OVB doesn’t just make estimates imprecise, it can flip the sign!
The intuition
If you can’t run an experiment and can’t observe all confounders, panel methods let you difference them out.
The intuition
If you can’t run an experiment and can’t observe all confounders, panel methods let you difference them out.
| Research question type | Data structure | Method | Lecture |
|---|---|---|---|
| Does X predict Y? | Cross-section | OLS with controls | L3 |
| Does X cause Y? (observable confounders) | Cross-section | OLS + careful controls | L3 |
| Does X cause Y? (unobservable confounders) | Panel | Fixed effects, DiD | L4 |
| How did the market react to event E? | Daily returns | Event study (CARs) | L5 |
| Does characteristic Z earn a risk premium? | Panel of returns | Fama–MacBeth | L5 |
| Does this US result hold in the Netherlands? | Same spec | Replication | L6 |
Key insight
The method follows from the question and the data structure and not the other way around.
| Lecture | Method | Identifies | Data structure | Key assumption |
|---|---|---|---|---|
| L3 | OLS + controls | Cross-sectional relationships | Cross-section | No omitted variables |
| L4 | DiD / Panel FE | Causal effect of treatment | Panel | Parallel trends |
| L5 | Event study | Market reaction to specific events | Daily panel | Market efficiency + correct factor model |
| L5 | Fama–MacBeth | Systematic return predictability | Repeated cross-sections | Cross-sections are independent draws |
Key insight
The art of empirical finance is choosing the method whose assumptions are most credible for your specific research question and data.
The standard error story has been building across the course:
All three address the same fundamental issue: when residuals are not independent, naive standard errors are too small. The solution is always to match the inference method to the correlation structure of the data.
The common thread
The choice of standard error is not a technicality, it is a statement about what you believe is independent in your data. Getting it wrong can flip your conclusions from significant to insignificant (or vice versa).
Huck (2024) — “The Psychological Externalities of Investing: Evidence from Stock Returns and Crime”
The Review of Financial Studies, 37(4), 1172–1213.
Do stock market returns affect the psychological well-being of both investors and noninvestors?
Precedent
Card and Dahl (2011) use intimate partner violence as a proxy for emotional shocks from NFL game outcomes. Same logic: if people feel worse, some act out in ways that show up in the data.
The contrasting predictions are what make this interesting — it’s not just “does X cause Y” but “does X cause Y differently for subgroups?”
This is consistent with relative status models (Abel 1990; Gali 1994) and the “keeping up with the Joneses” literature.
Link to L1–L2
This is the data pipeline from Lectures 1–2. Three raw sources ⤏ clean ⤏ merge ⤏ construct variables. Huck devotes a substantial part of the paper to describing these choices. Remember when 60–70% of your assignment time was data prep?
| Average | Median | SD | |
|---|---|---|---|
| Crime rate (all) | 22,659 | 17,675 | 18,565 |
| High-income | 18,347 | 14,547 | 14,577 |
| Medium-income | 24,215 | 19,618 | 18,894 |
| Low-income | 27,043 | 21,527 | 21,784 |
| Violent crime rate | 4,676 | 2,437 | 7,140 |
| High-income | 3,370 | 1,524 | 5,455 |
| Medium-income | 5,077 | 3,021 | 7,199 |
| Low-income | 6,093 | 3,637 | 8,748 |
| Market return (std.) | 0.038 | 0.082 | 1.044 |
\[y_{i,t} = \beta \cdot r_t + \gamma \cdot X_{i,t} + \theta_{i,a(t)} + \mu_{i,m(t)} + \omega_{i,w(t)} + \delta_{i,d(t)} + T_t + \varepsilon_{i,t}\]
| What it is | What it absorbs | |
|---|---|---|
| \(y_{i,t}\) | Crime rate for agency \(i\) on day \(t\) | Outcome variable |
| \(r_t\) | Standardized daily market return | Coefficient of interest |
| \(X_{i,t}\) | Weather, pollution, sports outcomes, celestial | Daily confounders |
| \(\theta_{i,a(t)}\) | Agency × year FE | Local business cycles, demographics |
| \(\mu_{i,m(t)}\) | Agency × month-of-year FE | Local seasonal crime patterns |
| \(\omega_{i,w(t)}\) | Agency × week-of-month FE | Payday effects |
| \(\delta_{i,d(t)}\) | Agency × day-of-week FE | Weekend effects |
| \(T_t\) | Turn-of-month + holiday FE | Calendar anomalies |
Standard errors clustered by time and agency.
The concern: maybe something else that happens on the same day drives both stock returns and crime (e.g., weather, sports, holidays).
Identification occurs within each agency-year. After removing all these patterns, the remaining variation in crime is compared with daily market return fluctuations.
Link to L4
This is the fixed effects logic taken seriously. Each layer of FE removes another source of confounding. The residual variation is what identifies the effect.
The model is estimated separately for high-income (investor) and low-income (noninvestor) locations:
| High-income | Low-income | |
|---|---|---|
| Market return | −12.47*** | +15.22** |
| (−3.291) | (2.043) | |
| % of avg | −0.37% | +0.25% |
| t(HIGH=LOW) | 3.313 |
Key takeaway
The stock market has real externalities beyond the portfolios it directly affects. This is what makes it a finance paper with social implications.
| t−3 | t−2 | t−1 | t | t+1 | t+2 | t+3 | |
|---|---|---|---|---|---|---|---|
| High | 1.2 | 6.1 | −6.4 | −12.5*** | −3.0 | −0.1 | 3.8 |
| Low | 6.4 | −9.1 | −3.8 | 15.2** | 5.0 | 4.7 | 2.1 |
Only the contemporaneous (day \(t\)) coefficients are significant!
Good empirical work doesn’t just show one result, it shows the result survives scrutiny.
Each robustness check addresses a specific threat to identification:
Positive local earnings surprises are associated with:
Same contrasting pattern — even after removing all common daily variation.
| Paper element | Skill/Tool | You learned this in |
|---|---|---|
| Three data sources ⤏ merged panel | Data wrangling, merge, reshape | L1 |
| Summary statistics, time series plots | EDA, distributional analysis | L2 |
| OLS regression, interpreting coefficients | OLS, hypothesis testing | L3 |
| Agency × year FE, agency × seasonal FE | Panel fixed effects, TWFE | L4 |
| Leads/lags analysis (event window logic) | Event study timing | L5 |
| Interaction terms (return × income group) | Differential effects, DiD intuition | L4 |
| Clustered standard errors | Correct inference | L3–L5 |
| Robustness to alternative specifications | Defending identification | L3–L5 |
The point
You have the entire toolkit to read, understand, and replicate this paper. That’s what you’ll do in Assignment 6.
Important difference
Monthly frequency is the biggest limitation. We cannot test Huck’s timing analysis (Table 4–5) or identify same-day effects. This is a key limitation you’ll discuss in Section 14.
Remember: Part A is 60–70% of the work. That’s normal.
Before regressions, you create:
Link to L2
This is the “look before you regress” principle. Your descriptive statistics should reveal patterns that your regressions will formalize.
The baseline model estimates the average effect of AEX returns on crime across all municipalities:
\[\text{Crime}_{it} = \beta \cdot r_t^{AEX} + \alpha_i + \varepsilon_{i,t}\]
\[\text{Crime}_{it} = \beta_0 + \beta_1 \text{Returns}_t + \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \varepsilon_{it}\]
Where:
Key test: Is \(\beta_2 < 0\) (high-income negative) and \(\beta_3 > 0\) (low-income positive)?
\[\text{Crime}_{it} = \beta_2 (\text{Returns}_t \times \text{High}_{i}) + \beta_3 (\text{Returns}_t \times \text{Low}_{i}) + \alpha_i + \gamma_t + \varepsilon_{it}\]
Where:
Section 13 asks you to create a single table combining all specifications:
🤦♀️ Just listing coefficients without interpretation
🤦♀️ Over-claiming causality (“AEX returns cause crime”)
🤦♀️ Ignoring null results (“we didn’t find it, so we won’t discuss it”)
💰 Null results are informative! If you don’t find the same pattern, that’s a finding worth discussing.
Think about why:
For your thesis
A well-explained null finding shows deeper understanding than a weak positive with no discussion.
“Monthly data cannot test within-day timing effects (Huck’s Table 4–5), limiting our ability to distinguish same-day from lagged effects.”
“Municipality-level income is an imperfect proxy for investor participation; Huck uses census micro-data allowing for more precise classification.”
“A single market return means no cross-sectional variation in ‘treatment’ within a given month — all municipalities face the same AEX return.”
“We don’t have enough data” ⤏ Too vague. What data? Why does it matter?
“Results might be wrong” ⤏ Undermines everything without explaining what could be wrong.
“The Netherlands is different from the US” ⤏ How is it different, and why does that affect your specific test?
| Limitation | Consequence | vs. Huck |
|---|---|---|
| Monthly frequency | Cannot identify contemporaneous effects; timing tests impossible | Huck uses daily + hourly data |
| Single market return | No cross-sectional treatment variation within months | Huck also has local earnings surprises |
| Ecological fallacy | Municipality income ≠ individual investor status | Same issue, but Huck has finer geography |
| Standard errors | Few time periods (~96 months), spatial correlation | Huck has ~6,000 trading days |
| Crime reporting | Measurement varies by municipality and crime type | NIBRS is standardized across agencies |
| Selection/attrition | Municipality mergers change boundaries over time | Huck filters for consistent reporters |
The direct connection
Assignment 6 is Module 2. Your improvement suggestions in Section 14.3 are Module 3 ideas. The skills you practiced — data wrangling, panel regression, robustness, interpretation — are exactly what your thesis demands.
“If you can’t explain why you’re including municipality fixed effects, the LLM can’t save you in your thesis defense.”
The value chain
Understanding ⤏ Design ⤏ Implementation. AI helps most with the last step, least with the first. This course focused on the first two — because those are what make you a researcher, not a code typist.
If you can’t describe your data structure, you’re not ready to estimate anything. Invest time in Part A.
Get the baseline working before you try new specifications. Confirm that your data pipeline produces sensible numbers.
Every published paper has limitations. Acknowledging them shows you understand the research design. Not acknowledging them suggests you don’t.
Thank You!
It’s been a pleasure. Good luck with Assignment 6 and your thesis!
Data Analytics for Finance