Data Analytics for Finance
BM17FI · Rotterdam School of Management
Regression Analysis
Setup¶
Run these cells to prepare your environment.
✅ The environment is cleared and ready.
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment > s/03-assignment 📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/03-assignment 📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat > e/private/assignments/03-assignment/data/raw 📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva > te/private/assignments/03-assignment/output 📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/ > private/assignments/03-assignment/output/tables 📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/03-assignment/output/figures
Learning Objectives¶
By completing this assignment, you will:
- Estimate OLS regressions using the
regresscommand - Interpret coefficients economically and statistically
- Test OLS assumptions (heteroskedasticity, normality of residuals)
- Apply robust standard errors when heteroskedasticity is detected
- Create publication-quality regression tables using
esttab
Research Question¶
"What firm characteristics explain the cross-sectional variation in stock price reactions to the Dieselgate scandal?"
Specifically, we want to know:
- Did German automakers experience more negative abnormal returns than other automotive firms?
- Can this effect be explained by firm characteristics such as size, leverage, and profitability?
Background: Event Study and Abnormal Returns¶
An abnormal return (AR) is the difference between a stock's actual return and its expected return based on market movements:
ARt = Rt - E[Rt]
Where:
- Rt = Actual return on day t
- E[Rt] = Expected return from market model (α + β × Market Return)
The Cumulative Abnormal Return (CAR) sums abnormal returns over an event window:
CAR[0,5] = AR0 + AR1 + AR2 + AR3 + AR4 + AR5
Interpretation: A CAR of -0.15 means the stock underperformed its expected return by 15 percentage points during the event window.
In this assignment, CARs have been pre-calculated using the market model with Fama-French Europe factors. See the codebook for methodology details. We will come back to this in more detail in assignment 5.
- Load and examine the cross-sectional dataset
- Perform exploratory analysis (correlations and scatter plots)
- Estimate a baseline OLS regression
- Add control variables and compare models
- Test for heteroskedasticity (Breusch-Pagan test)
- Test for normality of residuals (Shapiro-Wilk test)
- Re-estimate with robust standard errors
- Export a publication-quality regression table
- Interpret your findings
Section 1: Load and Examine Data¶
In this section, you'll load the cross-sectional dataset and examine its structure.
Task 1.1: Load the Dataset¶
Load the file auto_firms_event_crosssection.dta from the raw data folder.
use "$raw/filename.dta", clear to load a Stata dataset.
✅ Dataset loaded successfully
✅ Dataset loaded correctly
Task 1.2: Examine the Dataset¶
Use describe to examine the structure of the dataset. How many observations? How many variables?
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/03-assignment/data/raw/auto_firms_event_crosssection.dta
Observations: 84
Variables: 24 06 Jan 2026 15:08
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
gvkey str6 %-9s
conm str28 %-9s
fic str3 %-9s
german long %12.0g
car_0_5 double %10.0g
car_0_10 double %10.0g
log_at double %10.0g
leverage double %10.0g
roa double %10.0g
debt_equity double %10.0g
roe double %10.0g
margin double %10.0g
cash_ratio double %10.0g
at double %10.0g
sale double %10.0g
dltt double %10.0g
dlc double %10.0g
ceq double %10.0g
ebitda double %10.0g
ib double %10.0g
oibdp double %10.0g
che double %10.0g
n_days_0_5 long %12.0g
n_days_0_10 long %12.0g
-------------------------------------------------------------------------------
Sorted by:
Task 1.3: Summary Statistics¶
Generate summary statistics for the key variables: car_0_5, car_0_10, german, log_at, leverage, and roa.
summarize varlist to show mean, SD, min, and max for specified variables.
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
car_0_5 | 84 -.0074558 .0727016 -.3523587 .2633667
car_0_10 | 84 .0072848 .1414629 -.8872483 .2801222
german | 84 .0833333 .2780454 0 1
log_at | 84 10.36085 3.473371 2.858938 18.80747
leverage | 84 .2161082 .1966058 0 .9964587
-------------+---------------------------------------------------------
roa | 84 .0615514 .1036582 -.4132906 .3438476
✅ Summary statistics computed correctly
Section 2: Exploratory Analysis¶
Before running regressions, let's explore the relationships between variables using correlations and scatter plots.
Task 2.1: Correlation Matrix¶
Calculate the correlation matrix for car_0_5, german, log_at, leverage, and roa.
Question: Is car_0_5 negatively correlated with german? What does this suggest?
correlate varlist to display a correlation matrix.
(obs=84)
| car_0_5 german log_at leverage roa
-------------+---------------------------------------------
car_0_5 | 1.0000
german | -0.4954 1.0000
log_at | -0.2459 -0.0106 1.0000
leverage | 0.0084 0.0700 0.0684 1.0000
roa | -0.0652 0.0153 0.3246 -0.3275 1.0000
✅ Correlation matrix computed correctly
Task 2.2: Scatter Plot - CAR vs German¶
Create a scatter plot of car_0_5 (y-axis) against german (x-axis). Add proper axis labels and title. Save as graph export "$figures/scatter_car_german.png", replace width(1200).
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig > nments/03-assignment/output/figures/scatter_car_german.png written in PNG for > mat ✅ Scatter plot saved: scatter_car_german.png
✅ Scatter plot (CAR vs German) saved correctly
Task 2.3: Scatter Plot - CAR vs Size¶
Create a scatter plot of car_0_5 (y-axis) against log_at (x-axis). Save as graph export "$figures/scatter_car_size.png".
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig > nments/03-assignment/output/figures/scatter_car_size.png written in PNG forma > t ✅ Scatter plot saved: scatter_car_size.png
✅ Scatter plot (CAR vs Size) saved correctly
Checkpoint: Exploratory Analysis Complete¶
✅ Exploratory analysis plots created successfully
Section 3: Baseline OLS Regression¶
Now let's estimate a simple regression to test whether German firms experienced more negative CARs.
Task 3.1: Estimate Baseline Model¶
Estimate the regression: car_0_5 = β₀ + β₁×german + ε
Store the estimates with name m1 for later comparison.
Questions to consider:
- What is the coefficient on
german? Is it statistically significant? - What does this coefficient mean economically?
- What is the R-squared? What does it tell us?
regress depvar indepvars— OLS regressionestimates store name— Save estimates for later use
Source | SS df MS Number of obs = 84
-------------+---------------------------------- F(1, 82) = 26.68
Model | .107686651 1 .107686651 Prob > F = 0.0000
Residual | .331011944 82 .004036731 R-squared = 0.2455
-------------+---------------------------------- Adj R-squared = 0.2363
Total | .438698595 83 .005285525 Root MSE = .06354
------------------------------------------------------------------------------
car_0_5 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
german | -.1295467 .0250819 -5.16 0.000 -.1794425 -.0796508
_cons | .0033398 .0072405 0.46 0.646 -.0110639 .0177435
------------------------------------------------------------------------------
✅ Baseline model (m1) estimated and stored
✅ Baseline model (m1) estimated correctly
Section 4: Add Control Variables¶
The baseline model might suffer from omitted variable bias. Let's add control variables for firm characteristics.
For example:
- Size (log_at): Larger firms might have more diversified operations and be less affected by single-country scandals
- Leverage: Highly leveraged firms might be more vulnerable to negative shocks
- Profitability (roa): More profitable firms might weather bad news better
By including these controls, we can isolate the effect of being German, holding constant these other firm characteristics.
Task 4.1: Regression with Controls¶
Estimate the regression: car_0_5 = β₀ + β₁×german + β₂×log_at + β₃×leverage + β₄×roa + ε
Store the estimates with name m2.
Questions to consider:
- Did the coefficient on
germanchange compared to the baseline model? - What does this tell us about omitted variable bias?
- Did the R-squared improve? By how much?
Source | SS df MS Number of obs = 84
-------------+---------------------------------- F(4, 79) = 9.08
Model | .138161095 4 .034540274 Prob > F = 0.0000
Residual | .3005375 79 .003804272 R-squared = 0.3149
-------------+---------------------------------- Adj R-squared = 0.2802
Total | .438698595 83 .005285525 Root MSE = .06168
------------------------------------------------------------------------------
car_0_5 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
german | -.13205 .0244421 -5.40 0.000 -.1807007 -.0833992
log_at | -.0057776 .0021024 -2.75 0.007 -.0099622 -.0015929
leverage | .0303046 .0372949 0.81 0.419 -.0439291 .1045383
roa | .0413695 .0744365 0.56 0.580 -.1067927 .1895316
_cons | .0543134 .0222178 2.44 0.017 .0100899 .0985369
------------------------------------------------------------------------------
✅ Model with controls (m2) estimated and stored
✅ Model with controls (m2) estimated correctly
Task 4.2: Compare Models¶
Use estimates table to display both models side-by-side for easy comparison.
estimates table m1 m2, star stats(N r2) to compare stored models with stars for significance and fit statistics.
----------------------------------------------
Variable | m1 m2
-------------+--------------------------------
german | -.12954665*** -.13204996***
log_at | -.00577757**
leverage | .03030462
roa | .04136945
_cons | .00333977 .05431344*
-------------+--------------------------------
N | 84 84
r2 | .24546842 .31493398
----------------------------------------------
Legend: * p<0.05; ** p<0.01; *** p<0.001
✅ Model with controls has correct specification
Section 5: Diagnostics - Heteroskedasticity¶
OLS assumes homoskedasticity: constant variance of errors. Let's test this assumption.
- Inefficient coefficient estimates (not the best linear unbiased estimators)
- Biased standard errors → incorrect t-statistics and p-values
- Invalid hypothesis tests
Breusch-Pagan test: Tests H₀: homoskedasticity vs H₁: heteroskedasticity
If p-value < 0.05, we reject H₀ and conclude heteroskedasticity is present.
Task 5.1: Generate Residuals and Fitted Values¶
Re-estimate model 2 from Task 4.1 (car_0_5 german log_at leverage roa) and generate:
- Residuals (call them
resid) - Fitted values (call them
yhat)
predict varname, residuals— Generate residualspredict varname— Generate fitted values (default)
(option xb assumed; fitted values) ✅ Residuals and fitted values generated
✅ Residuals and fitted values generated correctly
Task 5.2: Plot Residuals vs Fitted Values¶
Create a scatter plot of residuals (y-axis) against fitted values (x-axis). Add a horizontal line at y=0. Save as graph export "$figures/residuals_vs_fitted.png", replace width(1200).
Interpretation: If the plot shows a "funnel" shape (residuals spreading out), that suggests heteroskedasticity.
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig > nments/03-assignment/output/figures/residuals_vs_fitted.png written in PNG fo > rmat ✅ Residual plot saved: residuals_vs_fitted.png
✅ Residual plot created and saved correctly
Task 5.3: Breusch-Pagan Test¶
Re-estimate model 2 from Task 4.1, then perform the Breusch-Pagan test for heteroskedasticity.
Question: What is the p-value? Do we reject the null hypothesis of homoskedasticity?
estat hettest to perform the Breusch-Pagan test.
Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
Assumption: Normal error terms
Variable: Fitted values of car_0_5
H0: Constant variance
chi2(1) = 23.19
Prob > chi2 = 0.0000
✅ Breusch-Pagan test completed
Test: Heteroskedasticity Diagnostics¶
✅ Heteroskedasticity diagnostics completed
Section 6: Diagnostics - Normality¶
OLS assumes that errors are normally distributed. Let's test this assumption.
Task 6.1: Histogram of Residuals¶
Create a histogram of the residuals with a normal distribution overlay. Save as graph export "$figures/histogram_residuals.png", replace width(1200).
Interpretation: If the histogram closely matches the normal curve, the normality assumption is reasonable.
histogram varname, normal to create a histogram with normal density overlay.
(bin=9, start=-.21515974, width=.05122012) file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig > nments/03-assignment/output/figures/histogram_residuals.png written in PNG fo > rmat ✅ Residual histogram saved: histogram_residuals.png
✅ Residual histogram saved correctly
Task 6.2: Shapiro-Wilk Test¶
Perform the Shapiro-Wilk test for normality on the residuals.
Question: What is the p-value? Do we reject the null hypothesis of normality?
swilk resid to perform the Shapiro-Wilk test for normality.
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
resid | 84 0.92648 5.253 3.644 0.00013
✅ Shapiro-Wilk test completed correctly
Section 7: Robust Standard Errors¶
If heteroskedasticity is present, we should use robust standard errors (Huber-White standard errors) to get valid inference.
When to use:
- Breusch-Pagan test rejects homoskedasticity (p < 0.05)
- Visual inspection of residual plots shows non-constant variance
- As a precaution in cross-sectional data (common practice)
Effect: Coefficients stay the same, but standard errors (and thus t-stats and p-values) change to be valid under heteroskedasticity.
Task 7.1: Re-estimate with Robust Standard Errors¶
Re-estimate model 2 (regress car_0_5 german log_at leverage roa) using the robust option. Store as m3.
Question: Did the standard errors change? Did the coefficients change?
robust option to the regress command: regress y x, robust
Linear regression Number of obs = 84
F(4, 79) = 4.61
Prob > F = 0.0021
R-squared = 0.3149
Root MSE = .06168
------------------------------------------------------------------------------
| Robust
car_0_5 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
german | -.13205 .0480099 -2.75 0.007 -.2276113 -.0364886
log_at | -.0057776 .0016403 -3.52 0.001 -.0090425 -.0025126
leverage | .0303046 .0388468 0.78 0.438 -.047018 .1076272
roa | .0413695 .0601818 0.69 0.494 -.0784194 .1611583
_cons | .0543134 .0182133 2.98 0.004 .0180608 .0905661
------------------------------------------------------------------------------
✅ Model with robust SE (m3) estimated and stored
✅ Robust SE model (m3) estimated correctly
Task 7.2: Compare All Three Models¶
Display all three models (m1, m2, m3) side-by-side using estimates table m1,..., star stats(N r2) .
--------------------------------------------------------------
Variable | m1 m2 m3
-------------+------------------------------------------------
german | -.12954665*** -.13204996*** -.13204996**
log_at | -.00577757** -.00577757***
leverage | .03030462 .03030462
roa | .04136945 .04136945
_cons | .00333977 .05431344* .05431344**
-------------+------------------------------------------------
N | 84 84 84
r2 | .24546842 .31493398 .31493398
--------------------------------------------------------------
Legend: * p<0.05; ** p<0.01; *** p<0.001
Section 8: Export Publication-Quality Regression Table¶
Now let's create a professional regression table using esttab.
Task 8.1: Create LaTeX Regression Table¶
Use esttab to export all three models to a LaTeX file called esttab m1 m2 m3 using "$tables/regression_table.tex", .
Requirements:
- Include standard errors below coefficients
- Add significance stars (\*, **, \***)
- Include R² and number of observations
- Add appropriate variable labels
esttab options:
se— Include standard errorsstar(* 0.10 ** 0.05 *** 0.01)— Significance starsstats(N r2)— Add N and R²label— Use variable labels instead of namesreplace— Overwrite existing file
label variable varname "Label" before running esttab to give your variables readable names in the table (e.g., label variable log_at "Log(Total Assets)").
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private > /private/assignments/03-assignment/output/tables/regression_table.tex) ✅ Regression table exported: regression_table.tex
✅ Regression table exported correctly
Section 9: Interpretation and Conclusions (optional/solution included)¶
Finally, let's interpret our findings and discuss limitations.
Task 9.1: Interpret Main Findings¶
In the markdown cell below, write 3-4 sentences interpreting your findings. Address:
- What is the magnitude of the German effect? (e.g., "German firms experienced X percentage points more negative returns...")
- Is this effect statistically significant?
- Did adding controls change the coefficient? What does this tell us?
- Based on the R², how much of the variation in CARs is explained by our model?
INTERPRETATION
Based on the regression results, German automakers experienced more negative cumulative abnormal (about 13 basis points) returns compared to other automotive firms during the Dieselgate scandal (days 0-5). This effect is statistically significant at conventional levels (p < 0.01), suggesting the market viewed the scandal as primarily affecting German manufacturers rather than the global automotive industry.
When we add control variables for firm size, leverage, and profitability, the coefficient on German firm status remains large and significant, indicating the effect is not simply due to German firms being different along these observable dimensions. However, the relatively modest R² (around 0.25-0.31) suggests that firm nationality and fundamentals explain only a portion of the cross-sectional variation in stock returns—idiosyncratic firm characteristics and investor sentiment likely play important roles.
The robust standard errors are slightly larger than the conventional standard errors, consistent with the presence of heteroskedasticity detected in the Breusch-Pagan test. Using robust standard errors provides more reliable inference in this setting.
Task 9.2: Discuss Limitations¶
In the markdown cell below, list 2-3 limitations of this analysis. Consider:
- Sample size
- Omitted variables
- Causality vs correlation
- Alternative explanations
LIMITATIONS:
Small sample size: With only 84 firms (7 German), our statistical power is limited, and results may be sensitive to outliers or individual firm characteristics.
Omitted variable bias: We control for basic firm fundamentals (size, leverage, profitability), but other important factors are missing—such as diesel exposure (% of sales from diesel vehicles), geographic revenue composition, or brand reputation—which could confound the German effect.
Correlation vs causation: While we find a strong association between German firm status and negative CARs, we cannot definitively claim causation. The effect could be driven by factors correlated with being German (e.g., regulatory exposure, media coverage) rather than German nationality per se. A more sophisticated identification strategy (e.g., difference-in-differences with non-automotive German firms as a control group) would strengthen causal claims.
References¶
- Fama-French Factors: Kenneth French Data Library - https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
- Event Study Methodology: MacKinlay, A. C. (1997). "Event Studies in Economics and Finance," Journal of Economic Literature, 35(1), 13-39
- Dieselgate: EPA Notice of Violation (September 18, 2015) - https://www.epa.gov/vw
- Stata Documentation: https://www.stata.com/features/documentation/
- Robust Standard Errors: White, H. (1980). "A Heteroskedasticity-Consistent Covariance Matrix Estimator," Econometrica, 48(4), 817-838
Data Analytics for Finance
BM17FI · Academic Year 2025–26
Created by: Caspar David Peter
© 2026 Rotterdam School of Management