Data Analytics for Finance
BM17FI · Rotterdam School of Management
Difference-in-Differences
Setup¶
Run these cells to prepare your environment.
✅ The environment is cleared and ready.
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment > s/04-assignment 📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/04-assignment 📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat > e/private/assignments/04-assignment/data/raw 📁 Processed data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance- > private/private/assignments/04-assignment/data/processed 📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva > te/private/assignments/04-assignment/output 📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/ > private/assignments/04-assignment/output/tables 📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/04-assignment/output/figures
Learning Objectives¶
By completing this assignment, you will:
- Understand the DiD identification strategy and its assumptions
- Set up panel data for difference-in-differences analysis
- Estimate canonical DiD and two-way fixed effects models
- Visualize parallel trends to assess the key identifying assumption
- Apply clustered standard errors appropriate for panel data
- Critically evaluate DiD assumptions with small samples
Research Question¶
"Did German automakers experience differential changes in ESG risk scores compared to non-German automakers following the Dieselgate scandal?"
We use RepRisk Index (RRI) scores as our outcome variable. RepRisk is a leading ESG data provider that systematically monitors companies for environmental, social, and governance risks based on media, stakeholder, and third-party sources.
The Dieselgate scandal broke on September 18, 2015, when the U.S. Environmental Protection Agency (EPA) issued a Notice of Violation to Volkswagen for installing "defeat devices" in diesel vehicles to cheat emissions tests.
The key idea: If both groups would have followed parallel trends in the absence of treatment, then the difference in their differences (before vs. after) identifies the causal effect.
DiD Estimator:
τ = (Ytreat,post - Ytreat,pre) - (Ycontrol,post - Ycontrol,pre)
Key Assumption: Parallel trends — in the absence of treatment, the treatment and control groups would have evolved similarly over time.
- 0-25: Low risk exposure
- 26-49: Medium risk exposure (typical for large multinationals)
- 50-59: High risk exposure
- 60-74: Very high risk exposure
- 75-100: Extremely high risk exposure
Interpretation: An increase in RRI indicates that the company has received more negative media coverage and stakeholder criticism related to ESG issues.
- Load the panel dataset and set up the panel structure
- Clean the sample by dropping VW group subsidiaries
- Create the post-treatment indicator variable
- Calculate summary statistics by group and time period
- Visualize parallel trends
- Estimate the canonical DiD regression
- Add control variables
- Estimate two-way fixed effects models
- Apply clustered standard errors
- Create an event study plot
- Export a publication-quality regression table
- Critically evaluate the DiD assumptions
Section 1: Load and Examine Panel Structure¶
In this section, you'll load the panel dataset, clean the sample by removing VW group subsidiaries, and set up the panel structure for DiD analysis.
Task 1.1: Load the Dataset¶
Load the file auto_firms_panel_did.dta from the raw data folder.
use "$raw/filename.dta", clear to load a Stata dataset.
✅ Dataset loaded successfully
Task 1.2: Examine the Dataset¶
Use describe to examine the structure of the dataset. Then use tabulate to see the distribution of firms by country (german indicator).
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/04-assignment/data/raw/auto_firms_panel_did.dta
Observations: 3,840
Variables: 20 09 Jan 2026 12:18
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
gvkey str6 %-9s
RepRisk_ID double %10.0g
conm str28 %-9s
fic str3 %-9s
sic str4 %-9s
naics str6 %-9s
german long %12.0g
date double %td
year double %10.0g
month double %10.0g
current_RRI double %10.0g
at double %10.0g
roa double %10.0g
cash_ratio double %10.0g
log_sale double %10.0g
leverage double %10.0g
RepRisk_rating str3 %-9s
social_percen~e str4 %-9s
governance_pe~e str4 %-9s
environmental~e str4 %-9s
-------------------------------------------------------------------------------
Sorted by:
german | Freq. Percent Cum.
------------+-----------------------------------
0 | 3,480 90.62 90.62
1 | 360 9.38 100.00
------------+-----------------------------------
Total | 3,840 100.00
| german
conm | 0 1 | Total
----------------------+----------------------+----------
ANHUI ANKAI AUTOMOB.. | 60 0 | 60
ANHUI JIANGHUAI AUT.. | 60 0 | 60
ASHOK LEYLAND LTD | 60 0 | 60
ASTON MARTIN LAGOND.. | 60 0 | 60
AUDI AG | 0 60 | 60
AUTECH CORP | 60 0 | 60
AVIC SHENYANG AIRCR.. | 60 0 | 60
AVTOVAZ PJSC | 60 0 | 60
BAIC BLUEPARK NEW E.. | 60 0 | 60
BAIC MOTOR CORP LTD | 60 0 | 60
BAYERISCHE MOTOREN .. | 0 60 | 60
BRILLIANCE CHINA AU.. | 60 0 | 60
BYD COMPANY LTD | 60 0 | 60
CHONG QING CHANGAN .. | 60 0 | 60
CHONGQING DIMA INDU.. | 60 0 | 60
DONGFENG MOTOR GROU.. | 60 0 | 60
DRB HICOM BERHAD | 60 0 | 60
EICHER MOTORS | 60 0 | 60
FERRARI NV | 60 0 | 60
FORD OTOMOTIV SANAY.. | 60 0 | 60
GAZ PJSC | 60 0 | 60
GB CORP | 60 0 | 60
GEELY AUTOMOBILE HL.. | 60 0 | 60
GREAT WALL MOTOR CO | 60 0 | 60
GUANGZHOU AUTOMOBIL.. | 60 0 | 60
HINDUSTAN MOTORS LTD | 60 0 | 60
HINO MOTORS LTD | 60 0 | 60
HONDA ATLAS CARS LTD | 60 0 | 60
HONDA MOTOR CO LTD | 60 0 | 60
HYUNDAI MOTOR CO LTD | 60 0 | 60
INDUS MOTORS CO PLC | 60 0 | 60
ISUZU MOTORS LTD | 60 0 | 60
JIANG LING MOTORS C.. | 60 0 | 60
KAMAZ PTC | 60 0 | 60
KG MOBILITY CORP | 60 0 | 60
KIA CORPORATION | 60 0 | 60
LIKHACHOV PLANT PJSC | 60 0 | 60
MAHINDRA & MAHINDRA.. | 60 0 | 60
MAN SE | 0 60 | 60
MARUTI SUZUKI INDIA.. | 60 0 | 60
MAZDA MOTOR CORP | 60 0 | 60
MERCEDES BENZ GROUP.. | 0 60 | 60
MITSUBISHI MOTORS C.. | 60 0 | 60
MORITA HLDGS | 60 0 | 60
NISSAN MOTOR CO LTD | 60 0 | 60
ORIENTAL HOLDINGS BHD | 60 0 | 60
PORSCHE AUTOMOBIL H.. | 0 60 | 60
RENAULT SA | 60 0 | 60
ROSENBAUER INTERNAT.. | 60 0 | 60
SAIC MOTOR CORP LTD | 60 0 | 60
SCANIA AB | 60 0 | 60
SOLLERS PJSC | 60 0 | 60
STELLANTIS NV | 60 0 | 60
SUBARU CORP | 60 0 | 60
SUZUKI MOTOR CO LTD | 60 0 | 60
TATA MOTORS PASSENGER | 60 0 | 60
TOFAS-TURK OTOMOBIL.. | 60 0 | 60
TOYOTA MOTOR CORP | 60 0 | 60
UMW HOLDINGS BHD | 60 0 | 60
VOLKSWAGEN AG | 0 60 | 60
VOLVO AB | 60 0 | 60
WILLIAMS GRAND PRIX.. | 60 0 | 60
YANGZHOU YAXING COA.. | 60 0 | 60
YULON MOTOR CO LTD | 60 0 | 60
----------------------+----------------------+----------
Total | 3,480 360 | 3,840
Task 1.3: Drop VW Group Subsidiaries¶
- AUDI AG (gvkey: 101120) — VW subsidiary
- PORSCHE AUTOMOBIL HOLDING SE (gvkey: 102187) — VW Group holding company
- SCANIA AB (gvkey: 062557) — VW subsidiary (commercial vehicles)
- MAN SE (gvkey: 100042) — VW subsidiary (trucks/buses)
Why drop these firms?
- Avoid double-counting: Subsidiaries' outcomes are mechanically linked to VW's scandal
- Clean identification: We want to compare independent German automakers (VW, BMW, Mercedes-Benz) to non-German peers
- Economic interpretation: Holding companies and subsidiaries don't represent separate firms for our research question
Drop the four VW Group firms listed above using the drop if command with the gvkey variable.
drop if inlist(gvkey, "value1", "value2", ...) to remove observations matching any of the listed values.
(240 observations deleted)
✅ VW Group firms dropped
Remaining firms:
| german
conm | 0 1 | Total
----------------------+----------------------+----------
ANHUI ANKAI AUTOMOB.. | 60 0 | 60
ANHUI JIANGHUAI AUT.. | 60 0 | 60
ASHOK LEYLAND LTD | 60 0 | 60
ASTON MARTIN LAGOND.. | 60 0 | 60
AUTECH CORP | 60 0 | 60
AVIC SHENYANG AIRCR.. | 60 0 | 60
AVTOVAZ PJSC | 60 0 | 60
BAIC BLUEPARK NEW E.. | 60 0 | 60
BAIC MOTOR CORP LTD | 60 0 | 60
BAYERISCHE MOTOREN .. | 0 60 | 60
BRILLIANCE CHINA AU.. | 60 0 | 60
BYD COMPANY LTD | 60 0 | 60
CHONG QING CHANGAN .. | 60 0 | 60
CHONGQING DIMA INDU.. | 60 0 | 60
DONGFENG MOTOR GROU.. | 60 0 | 60
DRB HICOM BERHAD | 60 0 | 60
EICHER MOTORS | 60 0 | 60
FERRARI NV | 60 0 | 60
FORD OTOMOTIV SANAY.. | 60 0 | 60
GAZ PJSC | 60 0 | 60
GB CORP | 60 0 | 60
GEELY AUTOMOBILE HL.. | 60 0 | 60
GREAT WALL MOTOR CO | 60 0 | 60
GUANGZHOU AUTOMOBIL.. | 60 0 | 60
HINDUSTAN MOTORS LTD | 60 0 | 60
HINO MOTORS LTD | 60 0 | 60
HONDA ATLAS CARS LTD | 60 0 | 60
HONDA MOTOR CO LTD | 60 0 | 60
HYUNDAI MOTOR CO LTD | 60 0 | 60
INDUS MOTORS CO PLC | 60 0 | 60
ISUZU MOTORS LTD | 60 0 | 60
JIANG LING MOTORS C.. | 60 0 | 60
KAMAZ PTC | 60 0 | 60
KG MOBILITY CORP | 60 0 | 60
KIA CORPORATION | 60 0 | 60
LIKHACHOV PLANT PJSC | 60 0 | 60
MAHINDRA & MAHINDRA.. | 60 0 | 60
MARUTI SUZUKI INDIA.. | 60 0 | 60
MAZDA MOTOR CORP | 60 0 | 60
MERCEDES BENZ GROUP.. | 0 60 | 60
MITSUBISHI MOTORS C.. | 60 0 | 60
MORITA HLDGS | 60 0 | 60
NISSAN MOTOR CO LTD | 60 0 | 60
ORIENTAL HOLDINGS BHD | 60 0 | 60
RENAULT SA | 60 0 | 60
ROSENBAUER INTERNAT.. | 60 0 | 60
SAIC MOTOR CORP LTD | 60 0 | 60
SOLLERS PJSC | 60 0 | 60
STELLANTIS NV | 60 0 | 60
SUBARU CORP | 60 0 | 60
SUZUKI MOTOR CO LTD | 60 0 | 60
TATA MOTORS PASSENGER | 60 0 | 60
TOFAS-TURK OTOMOBIL.. | 60 0 | 60
TOYOTA MOTOR CORP | 60 0 | 60
UMW HOLDINGS BHD | 60 0 | 60
VOLKSWAGEN AG | 0 60 | 60
VOLVO AB | 60 0 | 60
WILLIAMS GRAND PRIX.. | 60 0 | 60
YANGZHOU YAXING COA.. | 60 0 | 60
YULON MOTOR CO LTD | 60 0 | 60
----------------------+----------------------+----------
Total | 3,420 180 | 3,600
Task 1.4: Create the Post-Treatment Indicator¶
Create a variable called post that equals 1 for observations from September 2015 onwards, and 0 otherwise.
The Dieselgate scandal broke on September 18, 2015. Since our data is monthly (end-of-month dates), we treat August 2015 as the last "pre-treatment" month, so post = 1 starting from September 2015.
gen post = (date >= mdy(9, 1, 2015)) to create a dummy variable that equals 1 when the condition is true.
The
mdy() function creates a Stata date from month, day, and year.
| Post-treatment (Sept
| 2015+)
year | 0 1 | Total
-----------+----------------------+----------
2013 | 720 0 | 720
2014 | 720 0 | 720
2015 | 480 240 | 720
2016 | 0 720 | 720
2017 | 0 720 | 720
-----------+----------------------+----------
Total | 1,920 1,680 | 3,600
✅ Post-treatment indicator created
Task 1.5: Set Up the Panel Structure¶
Declare the data as panel data using xtset. Because xtset requires a numeric panel identifier, you must first convert the string variable gvkey into a new numeric variable called gvkey_num using encode. Then declare the panel with gvkey_num as the firm identifier and date (monthly) as the time variable.
encode gvkey, gen(gvkey_num)— Convert the string firm ID to a numeric variable required byxtsetxtset gvkey_num date, monthly— Declare the panel structure (firm × month)xtdescribe— Examine panel balance: are all firms observed in all periods?
warning: Variable date had been formatted %td (a daily period), and you asked
for a monthly period. Are you sure that is what you want? Format has
been changed. date is now formatted %tm.
Panel variable: gvkey_num (strongly balanced)
Time variable: date, 3575m10 to 3725m5, but with gaps
Delta: 1 month
gvkey_num: 1, 2, ..., 60 n = 60
date: 3575m10, 3578m2, ..., 3725m5 T = 60
Delta(date) = 1 month
Span(date) = 1796 periods
(gvkey_num*date uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
60 60 60 60 60 60 60
Freq. Percent Cum. | Pattern*
---------------------------+--------------------------------------------------
> -----------------
60 100.00 100.00 | 111111111111.111111111111.11111111111.1111111111
> 11.11111111111.11
---------------------------+--------------------------------------------------
> -----------------
60 100.00 | XXXXXXXXXXXX.XXXXXXXXXXXX.XXXXXXXXXXX.XXXXXXXXXX
> XX.XXXXXXXXXXX.XX
------------------------------------------------------------------------------
> -----------------
*Each column represents 28 periods.
✅ Panel structure declared
Task 1.6: Visualize VW's RepRisk Index Over Time¶
Plot the current_RRI over time for Volkswagen (gvkey == "100737") to see the scandal's effect visually. Add a vertical reference line at the event date (September 18, 2015). Save the plot as vw_rri_timeseries.png in the figures folder.
twoway line varname timevar if condition— Line plot for a subset of observationsxline(`=mdy(9,18,2015)')— Vertical reference line at a specific dategraph export "$figures/vw_rri_timeseries.png", replace— Save the graph to the figures folder
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig > nments/04-assignment/output/figures/vw_rri_timeseries.png written in PNG form > at ✅ VW RepRisk plot saved
Do not delete the code cell below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 1 tests passed
Section 2: Summary Statistics by Group-Time¶
The essence of DiD is comparing changes over time between treatment and control groups. In this section, we'll create a 2×2 table of means and manually calculate the DiD estimator.
Task 2.1: Create 2×2 Table of Means¶
Calculate the mean current_RRI for each combination of german (treatment) and post (time period). This gives us the four cell means needed for the DiD calculation.
table rowvar colvar, stat(mean outcome) to create a table of means by group.
------------------------------------------------------------------
| Post-treatment (Sept 2015+)
| 0 1 Total
--------------------------------+---------------------------------
german |
0 |
Mean | 14.56 18.18 16.25
Standard deviation | 18.12 19.06 18.65
Number of nonmissing values | 1824.00 1596.00 3420.00
1 |
Mean | 52.70 61.12 56.63
Standard deviation | 11.15 9.01 11.02
Number of nonmissing values | 96.00 84.00 180.00
Total |
Mean | 16.47 20.33 18.27
Standard deviation | 19.67 20.90 20.34
Number of nonmissing values | 1920.00 1680.00 3600.00
------------------------------------------------------------------
Task 2.2: Calculate DiD "By Hand"¶
Calculate the difference-in-differences estimator manually using the four cell means:
$$\tau_{DiD} = (\bar{Y}_{treat,post} - \bar{Y}_{treat,pre}) - (\bar{Y}_{control,post} - \bar{Y}_{control,pre})$$
Use summarize four times (once per cell) to obtain each mean, storing each in a local macro. Then compute the DiD as the difference of differences. Finally, store the result in a global macro called did_manual — this makes it available to later sections for comparison with the regression estimate.
quietly summarize current_RRI if german == 1 & post == 1thenlocal treat_post = r(mean)— repeat for each of the four group-time cellslocal did = (`treat_post' - `treat_pre') - (`control_post' - `control_pre')— compute the DiDglobal did_manual = `did'— store as a global so Section 4 can compare it to the regression coefficient
==== Difference-in-Differences Calculation ==== German firms (treated): Post: 61.12 Pre: 52.70 Diff: 8.42 Non-German firms (control): Post: 18.18 Pre: 14.56 Diff: 3.62 DiD Estimate: 4.80 ============================================== ✅ DiD calculated manually
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 2 tests passed Manual DiD estimate: 4.80
Section 3: Parallel Trends Visualization¶
The key identifying assumption for DiD is parallel trends: in the absence of treatment, treated and control groups would have followed similar trajectories. While we cannot directly test this (it's about a counterfactual), we can visually inspect whether the groups moved together before the treatment.
Task 3.1: Create the Parallel Trends Plot¶
Collapse the data to the mean current_RRI for each german × date cell, then create a two-line time series plot showing the average RRI trajectory for German firms (treated) and non-German firms (control) over time. Use preserve and restore to avoid overwriting the firm-level panel data.
The plot should include:
- One line per group with a legend labeling them "German (treated)" and "Non-German (control)"
- A vertical reference line at the event date (September 18, 2015)
Save the plot as parallel_trends.png in the figures folder.
preserve/restore— Save and recover the original dataset around the collapsecollapse (mean) mean_rri = current_RRI, by(german date)— Reduce to group-time meanstwoway (line mean_rri date if german==1) (line mean_rri date if german==0)— Overlay two lineslegend(order(1 "German (treated)" 2 "Non-German (control)"))— Label the two linesxline(`=mdy(9,18,2015)')— Add a vertical line at the event date
(Note: Below code run with echo to enable preserve/restore functionality.)
. preserve
. collapse (mean) mean_rri = current_RRI, by(german date)
. twoway (line mean_rri date if german == 1, lcolor(purple) lwidth(medium)) (li
> ne mean_rri date if german == 0, lcolor(gray) lwidth(medium)), title("Paralle
> l Trends: RepRisk Index by Treatment Group", size(medium)) xtitle("Date") yti
> tle("Mean RepRisk Index (RRI)") xline(`=mdy(9,18,2015)', lcolor(red) lpattern
> (dash)) legend(order(1 "German (treated)" 2 "Non-German (control)") rows(1) p
> osition(6)) note("Red dashed line: Dieselgate announcement (Sept 18, 2015)")
> name(parallel_trends, replace)
. graph export "$figures/parallel_trends.png", replace width(1200)
(file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/04-assignment/output/figures/parallel_trends.png not found)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/04-assignment/output/figures/parallel_trends.png written in PNG format
. display "✅ Parallel trends plot saved"
✅ Parallel trends plot saved
. restore
.
Do not delete the code cell below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 3 tests passed
Section 4: Canonical DiD Regression¶
Now let's estimate the canonical DiD regression. The model includes the treatment indicator (german), the post-treatment indicator (post), and their interaction (german × post). The coefficient on the interaction is the DiD estimate.
Yit = β₀ + β₁ × Germani + β₂ × Postt + β₃ × (Germani × Postt) + εit
Where:
- β₀: Mean outcome for control group in pre-period
- β₁: Baseline difference between treatment and control groups (pre-period)
- β₂: Time trend common to both groups
- β₃: The DiD estimate — differential change for treated group
Task 4.1: Create the Interaction Term¶
Create a new variable treated_post that equals german × post. This is the DiD interaction term.
gen treated_post = german * post to create the interaction term.
Means, Standard Deviations and Frequencies of German × Post (DiD)
| Post-treatment
| (Sept 2015+)
german | 0 1 | Total
-----------+----------------------+----------
0 | 0 0 | 0
| 0 0 | 0
| 1824 1596 | 3420
-----------+----------------------+----------
1 | 0 1 | .46666667
| 0 0 | .50027925
| 96 84 | 180
-----------+----------------------+----------
Total | 0 .05 | .02333333
| 0 .21800984 | .15098086
| 1920 1680 | 3600
✅ Interaction term created
Task 4.2: Estimate the Canonical DiD Regression¶
Estimate the regression: current_RRI = β₀ + β₁×german + β₂×post + β₃×treated_post + ε
Use robust standard errors by adding the , r option. Store the estimates as did1 for later comparison with richer specifications.
Questions to consider:
- What is the coefficient on
treated_post? Is it statistically significant? - Does it match your manual DiD calculation from Section 2?
- What does this coefficient mean economically (in terms of RepRisk Index points)?
regress outcome var1 var2 var3, r— OLS regression with robust standard errors (ris shorthand forrobust)estimates store did1— Save the regression results under the namedid1so they can be retrieved and tabulated later
Linear regression Number of obs = 3,600
F(3, 3596) = 900.76
Prob > F = 0.0000
R-squared = 0.1968
Root MSE = 18.238
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
german | 38.13322 1.209446 31.53 0.000 35.76195 40.50449
post | 3.619518 .6385266 5.67 0.000 2.367607 4.871428
treated_post | 4.801613 1.626815 2.95 0.003 1.612041 7.991185
_cons | 14.56469 .4243336 34.32 0.000 13.73273 15.39665
------------------------------------------------------------------------------
Manual DiD estimate: 4.80
Regression DiD estimate: 4.80
✅ Canonical DiD regression estimated
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 4 tests passed DiD coefficient: 4.80
Section 5: Add Control Variables¶
Adding time-varying control variables can improve precision and address potential omitted variable bias. However, controls must be "good" controls — variables that affect the outcome but are not themselves affected by treatment.
Task 5.1: DiD Regression with Controls¶
Add the following control variables to the regression: log_sale (firm size), leverage, and roa (profitability). Keep the same three DiD variables (german, post, treated_post) and use robust standard errors.
Store the estimates as did2.
Question: Did the DiD coefficient change much after adding controls? What does this tell us?
regress current_RRI german post treated_post log_sale leverage roa, r
Then store the results with
estimates store did2.
Linear regression Number of obs = 3,600
F(6, 3593) = 872.81
Prob > F = 0.0000
R-squared = 0.4507
Root MSE = 15.089
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
german | 35.78826 1.1932 29.99 0.000 33.44885 38.12768
post | 3.505634 .5317459 6.59 0.000 2.46308 4.548188
treated_post | 4.307353 1.510551 2.85 0.004 1.34573 7.268977
log_sale | 3.598123 .0934028 38.52 0.000 3.414995 3.781251
leverage | 2.400373 1.128868 2.13 0.034 .1870874 4.613659
roa | -33.82314 2.659785 -12.72 0.000 -39.03798 -28.6083
_cons | -24.07966 1.089107 -22.11 0.000 -26.21499 -21.94433
------------------------------------------------------------------------------
✅ DiD with controls estimated
Task 5.2: Compare Models¶
Compare the two DiD models did1 and did2 (with and without controls) using estimates table.
----------------------------------------------
Variable | did1 did2
-------------+--------------------------------
german | 38.133224*** 35.788262***
post | 3.6195175*** 3.5056342***
treated_post | 4.8016134** 4.3073531**
log_sale | 3.5981234***
leverage | 2.4003733*
roa | -33.823141***
_cons | 14.564693*** -24.079659***
-------------+--------------------------------
N | 3600 3600
r2 | .19678931 .45069059
----------------------------------------------
Legend: * p<0.05; ** p<0.01; *** p<0.001
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 5 tests passed DiD coefficient (with controls): 4.31
Section 6: Two-Way Fixed Effects¶
The Two-Way Fixed Effects (TWFE) model is a generalization of DiD that includes firm fixed effects and time fixed effects. This absorbs all time-invariant firm characteristics and all firm-invariant time trends.
Yit = αi + γt + β × (Germani × Postt) + εit
Where:
- αi: Firm fixed effects — absorb all time-invariant firm characteristics (including
german) - γt: Time fixed effects — absorb all firm-invariant time trends (including
post) - β: The DiD estimate — identified only from the interaction term
Key insight: With firm and time fixed effects, the
german and post main effects are absorbed (collinear with the fixed effects), so we only estimate the interaction term.
Task 6.1: Estimate TWFE Model¶
Use reghdfe to estimate the two-way fixed effects model. Absorb firm fixed effects (gvkey_num) and time fixed effects (year#month). Use robust standard errors.
Store the estimates as did3.
reghdfe command efficiently estimates models with high-dimensional fixed effects:
reghdfe current_RRI treated_post, absorb(gvkey_num year#month) vce(robust)
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 3,600
Absorbing 2 HDFE groups F( 1, 3480) = 15.60
Prob > F = 0.0001
R-squared = 0.8258
Adj R-squared = 0.8199
Within R-sq. = 0.0038
Root MSE = 8.6335
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
treated_post | 4.801613 1.215504 3.95 0.000 2.41844 7.184787
_cons | 18.16046 .1478644 122.82 0.000 17.87055 18.45037
------------------------------------------------------------------------------
Absorbed degrees of freedom:
------------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
--------------+---------------------------------------|
gvkey_num | 60 0 60 |
year#month | 60 1 59 |
------------------------------------------------------+
✅ TWFE model estimated
Note: german and post are absorbed by firm and time fixed effects
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 6 tests passed TWFE DiD coefficient: 4.80
Section 7: Clustered Standard Errors¶
With panel data, observations within the same firm are likely correlated over time. Clustering standard errors at the firm level accounts for this serial correlation.
- Only 3 German firms (VW, BMW, Mercedes-Benz) in the treated group
- Around 57 control firms
- Total of ~60 clusters
With few treated clusters, clustered standard errors may be too small, leading to over-rejection. Keep this limitation in mind when interpreting results.
Task 7.1: TWFE with Clustered Standard Errors¶
Re-estimate the TWFE model — same specification as Task 6.1 (absorb(gvkey_num year#month)) — but replace vce(robust) with standard errors clustered at the firm level (vce(cluster gvkey_num)).
Store the estimates as did4.
reghdfe current_RRI treated_post, absorb(gvkey_num year#month) vce(cluster gvkey_num)
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 3,600
Absorbing 2 HDFE groups F( 1, 59) = 3.79
Statistics robust to heteroskedasticity Prob > F = 0.0563
R-squared = 0.8258
Adj R-squared = 0.8199
Within R-sq. = 0.0038
Number of clusters (gvkey_num) = 60 Root MSE = 8.6335
(Std. err. adjusted for 60 clusters in gvkey_num)
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
treated_post | 4.801613 2.466501 1.95 0.056 -.1338435 9.73707
_cons | 18.16046 .0575517 315.55 0.000 18.0453 18.27562
------------------------------------------------------------------------------
Absorbed degrees of freedom:
------------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
--------------+---------------------------------------|
gvkey_num | 60 60 0 *|
year#month | 60 1 59 |
------------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
✅ TWFE with clustered SEs estimated
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 7 tests passed
Section 8: Event Study Specification¶
An event study specification allows us to examine the dynamics of the treatment effect over time. Instead of a single post indicator, we interact treatment with time-relative-to-event dummies. This helps visualize whether:
- The parallel trends assumption holds in the pre-period (coefficients should be near zero)
- The treatment effect is immediate or gradual
Task 8.1: Create Relative Time Variable¶
Create a variable rel_time that measures the number of months relative to the event (September 2015 = 0, the first post-treatment month). Negative values are pre-treatment, positive values are post-treatment.
mofd(date) to convert a date to months since Jan 1960, then subtract the event month.
Months |
relative to |
event (Sept |
2015 = 0) | Freq. Percent Cum.
------------+-----------------------------------
-32 | 60 1.67 1.67
-31 | 60 1.67 3.33
-30 | 60 1.67 5.00
-29 | 60 1.67 6.67
-28 | 60 1.67 8.33
-27 | 60 1.67 10.00
-26 | 60 1.67 11.67
-25 | 60 1.67 13.33
-24 | 60 1.67 15.00
-23 | 60 1.67 16.67
-22 | 60 1.67 18.33
-21 | 60 1.67 20.00
-20 | 60 1.67 21.67
-19 | 60 1.67 23.33
-18 | 60 1.67 25.00
-17 | 60 1.67 26.67
-16 | 60 1.67 28.33
-15 | 60 1.67 30.00
-14 | 60 1.67 31.67
-13 | 60 1.67 33.33
-12 | 60 1.67 35.00
-11 | 60 1.67 36.67
-10 | 60 1.67 38.33
-9 | 60 1.67 40.00
-8 | 60 1.67 41.67
-7 | 60 1.67 43.33
-6 | 60 1.67 45.00
-5 | 60 1.67 46.67
-4 | 60 1.67 48.33
-3 | 60 1.67 50.00
-2 | 60 1.67 51.67
-1 | 60 1.67 53.33
0 | 60 1.67 55.00
1 | 60 1.67 56.67
2 | 60 1.67 58.33
3 | 60 1.67 60.00
4 | 60 1.67 61.67
5 | 60 1.67 63.33
6 | 60 1.67 65.00
7 | 60 1.67 66.67
8 | 60 1.67 68.33
9 | 60 1.67 70.00
10 | 60 1.67 71.67
11 | 60 1.67 73.33
12 | 60 1.67 75.00
13 | 60 1.67 76.67
14 | 60 1.67 78.33
15 | 60 1.67 80.00
16 | 60 1.67 81.67
17 | 60 1.67 83.33
18 | 60 1.67 85.00
19 | 60 1.67 86.67
20 | 60 1.67 88.33
21 | 60 1.67 90.00
22 | 60 1.67 91.67
23 | 60 1.67 93.33
24 | 60 1.67 95.00
25 | 60 1.67 96.67
26 | 60 1.67 98.33
27 | 60 1.67 100.00
------------+-----------------------------------
Total | 3,600 100.00
✅ Relative time variable created
Task 8.2: Estimate Event Study Specification¶
Estimate an event study regression that interacts the german indicator with relative-time dummies. Because a window of ±36 months produces many sparse endpoint categories, first bin rel_time at ±12 months (so observations outside the window are treated as if they were exactly at the boundary). This gives rel_time_binned.
Stata's factor-variable notation (i.) requires non-negative integers, but rel_time_binned runs from −12 to +12. Shift it by 13 to create rel_time_shifted (where the reference period rel_time = −1 maps to rel_time_shifted = 12).
Use rel_time = −1 (August 2015, the last pre-treatment month) as the reference period, and cluster standard errors at the firm level.
- Bin:
gen rel_time_binned = rel_time, thenreplace rel_time_binned = -12 if rel_time < -12(and= 12for the upper end) - Shift:
gen rel_time_shifted = rel_time_binned + 13— maps rel_time −1 → shifted 12 (the reference) - Regression:
reghdfe current_RRI c.german#ib(12).rel_time_shifted, absorb(gvkey_num date) vce(cluster gvkey_num)
(1,200 real changes made)
(900 real changes made)
(MWFE estimator converged in 2 iterations)
note: 25.rel_time_shifted#c.german omitted because of collinearity
HDFE Linear regression Number of obs = 3,600
Absorbing 2 HDFE groups F( 24, 59) = 171.82
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.8272
Adj R-squared = 0.8201
Within R-sq. = 0.0116
Number of clusters (gvkey_num) = 60 Root MSE = 8.6283
(Std. err. adjusted for 60 clusters in gvkey_num)
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
rel_time_s~d#|
c.german |
1 | -5.155388 1.474523 -3.50 0.001 -8.105902 -2.204875
2 | 2.77193 1.981815 1.40 0.167 -1.193673 6.737532
3 | 1.438596 2.310226 0.62 0.536 -3.184155 6.061348
4 | 4.45614 2.945462 1.51 0.136 -1.437716 10.35
5 | -.4385965 1.41894 -0.31 0.758 -3.277889 2.400696
6 | -7.403509 3.497139 -2.12 0.038 -14.40127 -.4057492
7 | -2.701754 3.29291 -0.82 0.415 -9.290853 3.887344
8 | 1.175439 1.57676 0.75 0.459 -1.97965 4.330527
9 | -3.912281 3.766425 -1.04 0.303 -11.44888 3.624318
10 | -13.26316 5.086301 -2.61 0.012 -23.44082 -3.085493
11 | -12.19298 5.476883 -2.23 0.030 -23.1522 -1.233765
12 | -5.052632 1.735336 -2.91 0.005 -8.525032 -1.580232
13 | 7.912281 2.3308 3.39 0.001 3.24836 12.5762
14 | 6.035088 3.135369 1.92 0.059 -.2387705 12.30895
15 | 7.298246 2.743347 2.66 0.010 1.808821 12.78767
16 | 6.719298 3.066548 2.19 0.032 .5831491 12.85545
17 | 3.54386 3.496853 1.01 0.315 -3.453327 10.54105
18 | -.0175439 4.239059 -0.00 0.997 -8.499881 8.464794
19 | .754386 5.264943 0.14 0.887 -9.78074 11.28951
20 | -2.122807 5.360729 -0.40 0.694 -12.8496 8.603987
21 | -7.842105 8.306333 -0.94 0.349 -24.46304 8.778829
22 | -6.052632 5.429728 -1.11 0.269 -16.91749 4.81223
23 | -3.842105 2.959429 -1.30 0.199 -9.763908 2.079698
24 | -3.403509 4.546532 -0.75 0.457 -12.5011 5.694081
25 | 0 (omitted)
|
_cons | 18.3845 .0352392 521.71 0.000 18.31399 18.45502
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
gvkey_num | 60 60 0 *|
date | 60 1 59 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
Task 8.3: Plot the Event Study Coefficients¶
Extract the interaction coefficients and standard errors from the regression above and plot them as a connected scatter plot with 95% confidence interval bars. Save the plot as event_study_plot.png in the figures folder.
The plot should include:
- A horizontal zero line (no effect baseline)
- A vertical reference line between
t = −1andt = 0to mark the start of the post-treatment period
preserve and create variables to hold the plot data:
- Loop over
forvalues i = 1/25, assigning_b[c.german#`i'.rel_time_shifted]and the corresponding_se[]intocoefandsevariables - Set the reference period manually:
replace coef = 0 if rel_time_shifted == 12 - Build 95% CIs:
gen ci_top = coef + 1.96*seandci_bottom - Collapse to one row per period with
duplicates drop, then plot withtwoway (scatter coef rel_time_binned, connect(line)) (rcap ci_top ci_bottom rel_time_binned)
restore to recover the full dataset.
(Note: Below code run with echo to enable preserve/restore functionality.)
. preserve
. g coef = .
(3,600 missing values generated)
. g se = .
(3,600 missing values generated)
. forvalues i = 1/25 {
2. if `i' != 12 {
3. replace coef = _b[c.german#`i'.rel_time_shifted] if rel_time_shift
> ed == `i'
4. replace se = _se[c.german#`i'.rel_time_shifted] if rel_time_shifte
> d == `i'
5. }
6. }
(1,260 real changes made)
(1,260 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(60 real changes made)
(960 real changes made)
(960 real changes made)
. replace coef = 0 if rel_time_shifted == 12
(60 real changes made)
. replace se = 0 if rel_time_shifted == 12
(60 real changes made)
. g ci_top = coef + 1.96*se
. g ci_bottom = coef - 1.96*se
. keep rel_time_shifted rel_time_binned coef se ci_*
. duplicates drop
Duplicates in terms of all variables
(3,575 observations deleted)
. twoway (scatter coef rel_time_binned, connect(line) mcolor(navy) lcolor(navy)
> ) (rcap ci_top ci_bottom rel_time_binned, lcolor(navy%50)) (function y = 0, r
> ange(-12 12) lcolor(red) lpattern(dash)), xline(-0.5, lcolor(gray) lpattern(d
> ash)) xtitle("Months Relative to Event (Sep 2015 = 0)") ytitle("Coefficient (
> DiD Effect)") title("Event Study: Effect of Dieselgate on German Automakers'
> RRI") legend(off) caption("Reference period: t = -1 (August 2015). 95% CIs sh
> own." "Dashed vertical line marks first post-treatment month.") name(event_st
> udy, replace)
. graph export "$figures/event_study_plot.png", replace width(1200)
(file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/04-assignment/output/figures/event_study_plot.png not found)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/04-assignment/output/figures/event_study_plot.png written in PNG forma
> t
. restore
. display "✅ Event study plot saved"
✅ Event study plot saved
.
Do not delete the code cells below. They contain hidden tests that will check your work after you complete the tasks.
✅ Section 8 tests passed
Section 9: Regression Table¶
Create a publication-quality regression table showing the progression of specifications.
Task 9.1: Export Regression Table¶
Use esttab to export all four DiD models (did1, did2, did3, did4) to a LaTeX file called did_regression_table.tex in the tables folder. The table should show the progression from canonical DiD to TWFE with clustered standard errors.
se— Show standard errorsstar(* 0.10 ** 0.05 *** 0.01)— Significance starsstats(N r2)— Include N and R²mtitles()— Column titles
(file
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
> gnments/04-assignment/output/tables/did_regression_table.tex not found)
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/04-assignment/output/tables/did_regression_table.tex)
✅ Regression table exported: did_regression_table.tex
Do not delete the code cell below. It contains hidden tests that will check your work after you complete the tasks.
✅ Section 9 tests passed
Section 10: Critical Evaluation (Ungraded)¶
No research design is perfect. In this section, reflect on the limitations of our DiD analysis and what would strengthen our causal claims.
Task 10.1: Discuss Limitations¶
In the markdown cell below, discuss at least three limitations of this DiD analysis. Consider:
Parallel trends assumption: Can we confidently say that German and non-German automakers would have followed similar RRI trends absent Dieselgate?
Small number of treated units: We have only 3 German firms in the treated group. What are the implications for inference?
Other threats to identification: Are there other events around September 2015 that might affect German automakers differently? (e.g., refugee crisis, other policy changes)
Generalizability: Can we generalize findings from the auto industry to other contexts?
Measurement: Is RepRisk a good measure of the "true" ESG risk impact of Dieselgate?
LIMITATIONS:
Small number of treated units: With only 3 German automakers (Volkswagen, BMW, Mercedes-Benz) in the treated group, our statistical power is limited. Standard asymptotic inference (including clustered standard errors) may be unreliable with so few clusters. A single outlier firm could substantially influence the results. More robust methods like randomization inference or wild cluster bootstrap might be appropriate.
Parallel trends assumption: While the parallel trends plot suggests reasonably similar pre-trends, we cannot formally test the assumption (it concerns a counterfactual). The pre-period shows some divergence between groups, and the control group (non-German automakers) is quite heterogeneous — including Japanese, French, Italian, and Korean firms that may face different industry dynamics.
Potential confounders: Other events occurred around September 2015 that might differentially affect German firms:
- The European refugee crisis peaked in late 2015, with Germany taking a prominent role
- ECB monetary policy changes
- Other regulatory developments in the automotive sector These confounders could bias our estimate if they affected German automakers' ESG profiles differently from control firms.
Spillover effects: Our identification assumes no spillover from treated to control firms. However, if Dieselgate increased scrutiny of all automakers' emissions practices, control firms might also have experienced RRI changes — biasing our DiD estimate toward zero.
Section 11: Extension — Spillover Analysis (Ungraded)¶
This analysis helps distinguish between direct effects (VW was caught cheating) and contagion/spillover effects (guilt by national association).
Extension 11.1: Create Treatment Subgroups¶
Create two new treatment indicators:
vw: 1 for Volkswagen only (gvkey = "100737")other_german: 1 for other German firms (BMW, Mercedes-Benz)
| Volkswagen
conm | 0 1 | Total
----------------------+----------------------+----------
BAYERISCHE MOTOREN .. | 60 0 | 60
MERCEDES BENZ GROUP.. | 60 0 | 60
VOLKSWAGEN AG | 0 60 | 60
----------------------+----------------------+----------
Total | 120 60 | 180
| Other German (BMW,
| Mercedes)
conm | 0 1 | Total
----------------------+----------------------+----------
BAYERISCHE MOTOREN .. | 0 60 | 60
MERCEDES BENZ GROUP.. | 0 60 | 60
VOLKSWAGEN AG | 60 0 | 60
----------------------+----------------------+----------
Total | 60 120 | 180
✅ Treatment subgroups created
Extension 11.2: Separate DiD for VW vs Other German Firms¶
Run two separate DiD regressions:
- VW vs. non-German control firms
- Other German firms vs. non-German control firms
Compare the magnitudes. Is the effect larger for VW (the scandal firm)?
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 3,480
Absorbing 2 HDFE groups F( 1, 57) = 99.19
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.8100
Adj R-squared = 0.8034
Within R-sq. = 0.0056
Number of clusters (gvkey_num) = 58 Root MSE = 8.6291
(Std. err. adjusted for 58 clusters in gvkey_num)
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
vw_post | 9.817982 .9857921 9.96 0.000 7.843968 11.792
_cons | 16.96066 .0079317 2138.35 0.000 16.94478 16.97654
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
gvkey_num | 58 58 0 *|
date | 60 1 59 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
(MWFE estimator converged in 2 iterations)
HDFE Linear regression Number of obs = 3,540
Absorbing 2 HDFE groups F( 1, 58) = 1.94
Statistics robust to heteroskedasticity Prob > F = 0.1695
R-squared = 0.8125
Adj R-squared = 0.8060
Within R-sq. = 0.0006
Number of clusters (gvkey_num) = 59 Root MSE = 8.6570
(Std. err. adjusted for 59 clusters in gvkey_num)
------------------------------------------------------------------------------
| Robust
current_RRI | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
other_germ~t | 2.293429 1.648419 1.39 0.169 -1.006241 5.593099
_cons | 17.4979 .0260767 671.02 0.000 17.4457 17.5501
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
gvkey_num | 59 59 0 *|
date | 60 1 59 |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation
==== Spillover Analysis Results ====
----------------------------------------------
Variable | vw_only other_germa~y
-------------+--------------------------------
vw_post | 9.8179825***
other_germ~t | 2.2934289
_cons | 16.96066*** 17.497901***
-------------+--------------------------------
N | 3480 3540
r2 | .809981 .81249319
----------------------------------------------
Legend: * p<0.05; ** p<0.01; *** p<0.001
Interpretation:
- VW coefficient: Direct effect on scandal firm
- Other German coefficient: Spillover/contagion effect on peers
Extension 11.3: Interpretation¶
If VW shows a larger effect than other German firms, this suggests the reputational damage was concentrated at the scandal firm. If both show similar effects, it suggests broader contagion to the "German automotive" brand.
References¶
- The Effect (Free Online Textbook): Huntington-Klein, N. (2021). The Effect: An Introduction to Research Design and Causality
- RepRisk Methodology: RepRisk Research Methodology
- Dieselgate Timeline: EPA Notice of Violation (September 18, 2015) - EPA Press Release
Data Analytics for Finance
BM17FI · Academic Year 2025–26
Created by: Caspar David Peter
© 2026 Rotterdam School of Management