Data Analytics for Finance
| Assignment | Release | Deadline | Graded |
|---|---|---|---|
| A1 | 11.02.2026 | 25.02.2026 | \(\checkmark\) |
| A2 | 11.02.2026 | 25.02.2026 | \(\checkmark\) |
| A3 | 18.02.2026 | 04.03.2026 | |
| A4 | 25.02.2026 | 11.03.2026 | |
| A5 | 04.03.2026 | 18.03.2026 | |
| A6 | 11.03.2026 | 25.03.2026 |
Assignment 5 Is Out!
A5 asks you to implement a full event study on the Dieselgate data. Everything we cover today is directly relevant. Deadline: March 18, 2026.
Today we add two specialized tools for finance research: event studies for measuring market reactions to specific events, and Fama–MacBeth regressions for testing whether firm characteristics systematically predict returns over time.
| Method | Best for | Identification | Time scale |
|---|---|---|---|
| OLS (L3) | Cross-sectional relationships | Controls + research design | Cross-section |
| DiD / Panel FE (L4) | Policy changes, treatment rollouts | Parallel trends + FE | Months–years |
| Event study (L5) | Market reactions to specific events | Market efficiency + factor model | Days |
| Fama–MacBeth (L5) | Systematic return predictability | Repeated cross-sections | Months–years |
Each method makes different assumptions to construct the counterfactual. No single method dominates — the right choice depends on the research question and data structure.
Event studies rely on a version of the semi-strong form of market efficiency (Fama 1970):
Stock prices quickly incorporate all publicly available information. When new information arrives, prices adjust rapidly — abnormal returns are concentrated in a short window around the event.
Identification assumption
The event study identification assumption is analogous to parallel trends in DiD. In DiD we assume: absent treatment, treated and control groups would have followed the same trend. In event studies we assume: absent the event, the stock would have earned its expected return from the factor model.
“What was the stock market reaction to the Dieselgate scandal announcement on September 18, 2015?”
On this date, the U.S. Environmental Protection Agency (EPA) issued a Notice of Violation to Volkswagen for using illegal “defeat devices” in diesel engines. The question is whether — and how much — investors punished German automakers relative to non-German competitors.
The simplest expected return model:
\[R_{i,t} = \alpha_i + \beta_i R_{m,t} + \varepsilon_{i,t}\]
This is just OLS! We regress firm returns on market returns using only the estimation window.
If VW stock typically moves 1.2% for every 1% move in the DAX (\(\hat{\beta} = 1.2\)), and earns an extra 0.01% per day on average (\(\hat{\alpha} = 0.0001\)), then on any given day we can predict what VW “should” have returned — absent any firm-specific news.
| Model | Formula | Pros | Cons |
|---|---|---|---|
| Market Model | \(R_i = \alpha + \beta R_m + \varepsilon\) | Simple, intuitive, widely used | Ignores size/value effects |
| CAPM | \(R_i - R_f = \beta (R_m - R_f) + \varepsilon\) | Theoretical foundation | Constrains \(\alpha = 0\) |
| Fama–French 3-Factor | \(R_i - R_f = \alpha + \beta_{mkt}(R_m - R_f) + \beta_{smb}SMB + \beta_{hml}HML + \varepsilon\) | Controls for size and value | More parameters to estimate |
For short-window studies, the model choice typically does not matter much. We will verify this by comparing Market Model and FF3 results in A5.
\[R_{i,t} - R_{f,t} = \alpha_i + \beta_{mkt}(R_{m,t} - R_{f,t}) + \beta_{smb}\, SMB_t + \beta_{hml}\, HML_t + \varepsilon_{i,t}\]
If German automakers are systematically large, value firms, the Market Model may attribute some of their normal return to alpha when it is actually compensation for size and value exposure. FF3 controls for this — reducing potential misattribution bias in the abnormal return calculation.
\[AR_{i,t} = R_{i,t} - \hat{E}[R_{i,t}]\]
In the estimation window, the mean abnormal return should be approximately zero — because we estimated the model on that data. If \(\bar{AR} \neq 0\) in the estimation window, something has gone wrong. This is the event study equivalent of checking OLS residuals.
\[AR_{i,t}^{MM} = R_{i,t} - \hat{\alpha}_i - \hat{\beta}_i R_{m,t}\]
\[AR_{i,t}^{FF3} = R_{i,t} - R_{f,t} - \hat{\alpha}_i - \hat{\beta}_{mkt}(R_{m,t} - R_{f,t})\] \[- \hat{\beta}_{smb} SMB_t - \hat{\beta}_{hml} HML_t\]
On September 18, 2015, VW’s actual return was far below what its beta and the market return would predict.The large negative \(AR\) is the market’s verdict on Dieselgate.
\[CAR_i[0, +5] = \sum_{t=0}^{5} AR_{i,t}\]
Connection to A3
In Assignment 3, you used pre-calculated CARs as dependent variables in cross-sectional regressions. Now you understand where those numbers came from — and in A5 you will calculate them yourself.
This is the visual equivalent of the DiD plot you saw in L4 — but at daily frequency with the “treatment” being the scandal announcement.
Once you have CARs, you need to test whether they are statistically significant:
Small sample warning
With only 3 German automakers (after dropping VW subsidiaries), t-tests have low statistical power. Interpret results in terms of economic magnitude alongside statistical significance.
After computing CARs, we can ask: what firm characteristics predict larger (or smaller) reactions?
\[CAR_i[0, +5] = \gamma_0 + \gamma_1 \text{German}_i + \gamma_2 \text{Size}_i + \gamma_3 \text{ROA}_i + \gamma_4 \text{Leverage}_i + u_i\]
This is exactly what you did in A3 — but now you understand the full pipeline from raw returns to CARs to cross-sectional tests.
| Aspect | Key point |
|---|---|
| When to use | Market reactions to specific, well-dated events |
| Key assumption | Market efficiency — prices reflect public information quickly |
| Identification | Expected return model provides the counterfactual |
| Strengths | Precise timing, clean identification in short windows |
| Limitations | Single events, small N, model dependence for longer horizons |
| In A5 | You implement the full pipeline on Dieselgate data |
The classic event study asks: “What would this stock have returned absent the event?”
That is fundamentally a counterfactual question — the same question DiD asks, just at daily frequency. Goldsmith-Pinkham and Lyu (2025) make this connection explicit:
The traditional abnormal return \(AR_{i,t} = R_{i,t} - \hat{E}[R_{i,t}]\) is really an estimate of the average treatment effect on the treated — where the “treatment” is the event and the “counterfactual” comes from the factor model.
Same revolution, different setting
Two dimensions determine how much trouble you face:
| Short-run (days) | Long-run (months/years) | |
|---|---|---|
| Many events, random timing | Model choice barely matters ✅ | Model choice matters ⚠️ |
| Single event | It depends ⚠️ | Almost always biased ❌ |
Acemoglu et al. (2016) study firms connected to Timothy Geithner around his nomination as Treasury Secretary (November 21, 2008):
Acemoglu et al. (2016) study firms connected to Timothy Geithner around his nomination as Treasury Secretary (November 21, 2008):
Acemoglu et al. (2016) study firms connected to Timothy Geithner around his nomination as Treasury Secretary (November 21, 2008):
Instead of assuming you know the right factor model, let the data construct the counterfactual:
| Approach | Key idea | Connection |
|---|---|---|
| Synthetic control | Weighted portfolio of control firms that matches pre-event returns | Same as Abadie’s SCM from policy evaluation |
| GSynth / PCA regression | Data-driven factor structure estimated from the pre-event period | Lets the data find the factors instead of assuming them |
| Synthetic DiD | Combines synthetic control weighting with DiD structure | Bridges the two literatures |
Practical guidance
The good news: A5 uses a short event window [0,+5] and compares two benchmark models. If both tell the same story, your results are robust to the concerns raised by this new literature.
Event studies ask: “Did this specific event move stock prices?”
A different question: “Do certain firm characteristics systematically predict returns across time?”
When you pool all firm-month observations and run OLS, you implicitly assume residuals are independent across firms within the same period. But…
The consequence
Pooled OLS dramatically understates standard errors, leading to inflated t-statistics and false conclusions about significance. This is similar to the Moulton problem from L4 — but the correlation is cross-sectional (across firms in the same period) rather than within-group over time.
Fama and MacBeth (1973) proposed an elegant two-step solution:
Collect T slope coefficients \(\hat{\gamma}_{1,1}, \hat{\gamma}_{1,2}, \ldots, \hat{\gamma}_{1,T}\)
\(\bar{\gamma}_1 = \frac{1}{T} \sum_{t=1}^{T} \hat{\gamma}_{1,t} \quad \text{and} \quad SE(\bar{\gamma}_1) = \frac{s(\hat{\gamma}_{1,t})}{\sqrt{T}}\)
The key insight
Each cross-sectional regression produces one estimate per period. By treating these as a time series, you automatically account for cross-sectional correlation — because each \(\hat{\gamma}_{1,t}\) already incorporates whatever common shock happened in period \(t\).
For each time period \(t\) (e.g. each month):
\[R_{i,t} = \gamma_{0,t} + \gamma_{1,t}\, X_{i,t-1} + \varepsilon_{i,t}\]
Imagine 3–4 scatter plots side by side, each showing one month’s cross-section: firm returns on the y-axis vs. a characteristic (e.g. ESG score) on the x-axis. Each scatter plot has its own fitted line. The slopes vary from month to month — some steep, some flat, some negative. FM collects all these slopes and asks: on average, is the slope positive?
\[\bar{\gamma}_1 = \frac{1}{T} \sum_{t=1}^{T} \hat{\gamma}_{1,t}\]
The average slope across all periods is our estimate of the systematic relationship between the characteristic and returns — the risk premium, if you are thinking in asset pricing terms.
\[SE(\bar{\gamma}_1) = \frac{s(\hat{\gamma}_{1,t})}{\sqrt{T}}\]
where \(s(\hat{\gamma}_{1,t})\) is the standard deviation of the \(T\) estimated slopes.
Why this solves the problem
Pooled OLS treats each firm-month as independent → too many “observations.” FM recognizes that there are only \(T\) independent cross-sections. The time-series variation in \(\hat{\gamma}_{1,t}\) reflects the true uncertainty about the average relationship. If the slope is stable over time, SE is small; if it bounces around, SE is large.
This distribution is the core diagnostic of Fama–MacBeth. It shows you not just the average effect, but how consistent it is over time.
| Aspect | Pooled OLS | Panel FE (L4) | Fama–MacBeth |
|---|---|---|---|
| Problem addressed | — | Time-invariant confounders | Cross-sectional correlation of residuals |
| How | Single regression on all data | Entity & time fixed effects | Repeated cross-sectional regressions |
| SE adjustment | None (or robust/clustered) | Cluster by entity or time | Time-series variation of coefficients |
| Typical use | Baseline specification | Corporate finance panels | Asset pricing, return predictability |
| Assumes | Independence | Parallel trends, no time-varying confounders | Stable cross-sectional relationship |
| Risk | Overstated significance | Absorbs too much variation | Imprecise first-stage estimates (EIV) |
Petersen (2008) shows that FM and clustering by time address the same problem (cross-sectional correlation). In practice, double-clustering (by firm and time) is a modern alternative. FM remains standard for asset pricing tests because of its interpretability: the distribution of \(\hat{\gamma}_{1,t}\) values is informative in its own right.
Fama–MacBeth is everywhere in asset pricing and corporate finance:
If you read a finance paper that reports “Fama–MacBeth regressions,” you now know exactly what the authors did and why.
| Aspect | Key point |
|---|---|
| When to use | Testing whether firm characteristics systematically predict returns over time |
| Key insight | Two-step procedure that automatically handles cross-sectional correlation |
| Step 1 | Run a cross-sectional regression for each time period |
| Step 2 | Average the coefficients; use time-series variation for SE |
| Strengths | Interpretable, widely used, addresses cross-sectional correlation |
| Limitations | EIV bias, assumes stable relationship, balanced panel preferred |
| Alternative | Double-clustered SE (Petersen, 2009) for the same problem |
| Lecture | Method | Identifies | Data structure | Key assumption |
|---|---|---|---|---|
| L3 | OLS + controls | Cross-sectional relationships | Cross-section | No omitted variables (selection on observables) |
| L4 | DiD / Panel FE | Causal effect of treatment | Panel | Parallel trends |
| L5 | Event study | Market reaction to specific events | Daily panel | Market efficiency + correct factor model |
| L5 | Fama–MacBeth | Systematic return predictability | Repeated cross-sections | Cross-sections are independent draws |
Each method trades off different assumptions. The art of empirical finance is choosing the method whose assumptions are most credible for your specific research question and data.
The standard error story has been building across the course:
All three address the same fundamental issue: when residuals are not independent, naive standard errors are too small. The solution is always to match the inference method to the correlation structure of the data.
The common thread
The choice of standard error is not a technicality, it is a statement about what you believe is independent in your data. Getting it wrong can flip your conclusions from significant to insignificant (or vice versa).
Assignment 5 asks you to:
This is the full event study pipeline from raw data to publishable results. Everything from today’s lecture maps directly onto specific assignment tasks.
Both methods from today are staples of finance research:
For your replication project
Understanding the mechanics, assumptions, and limitations of the applied methods puts you in a strong position to critically evaluate and extend the original analysis.
In the final lecture, we bring everything together:
Key message from this course
The methods are tools, not ends in themselves. The value is in understanding when each tool is appropriate, what it assumes, and how to interpret the results in the context of a real research question.
Thank You for Your Attention!
See You in the Final Lecture!
MacKinlay (1997) established the modern event study framework:
This paper remains the default citation for event study methodology. Nearly every event study paper begins with a sentence like: “We follow the standard event study methodology of MacKinlay (1997).”
Fama and French (1993) documented that size and value explain a large share of cross-sectional return variation that CAPM beta alone cannot. Whether these reflect risk or mispricing remains debated — but the empirical regularity is well-established.
Petersen (2008) provides a decision framework:
| Correlation structure | Solution |
|---|---|
| Firm effect (residuals correlated within firm over time) | Cluster by firm |
| Time effect (residuals correlated across firms within period) | Cluster by time, or Fama–MacBeth |
| Both | Double-cluster by firm AND time |
Key finding: Fama–MacBeth and clustering by time produce similar standard errors when the number of time periods is large. Double-clustering is more general and handles both dimensions simultaneously.
Kolari and Pynnönen (2010) address a subtle issue: even in short-window event studies, abnormal returns may be cross-sectionally correlated when the event affects multiple firms simultaneously (as in Dieselgate).
Standard event study tests assume independence across firms. When this fails:
This matters for Dieselgate because all German automakers are hit by the same event on the same day — their ARs are not independent draws.
Data Analytics for Finance