Data Analytics for Finance

BM17FI · Rotterdam School of Management

RSM Logo

ASSIGNMENT 03

Regression Analysis

Setup¶

Run these cells to prepare your environment.

✅ The environment is cleared and ready.
/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment
> s/03-assignment
📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/03-assignment
📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat
> e/private/assignments/03-assignment/data/raw
📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva
> te/private/assignments/03-assignment/output
📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/
> private/assignments/03-assignment/output/tables
📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/03-assignment/output/figures

Learning Objectives¶

By completing this assignment, you will:

  1. Estimate OLS regressions using the regress command
  2. Interpret coefficients economically and statistically
  3. Test OLS assumptions (heteroskedasticity, normality of residuals)
  4. Apply robust standard errors when heteroskedasticity is detected
  5. Create publication-quality regression tables using esttab

Research Question¶

"What firm characteristics explain the cross-sectional variation in stock price reactions to the Dieselgate scandal?"

Specifically, we want to know:

  • Did German automakers experience more negative abnormal returns than other automotive firms?
  • Can this effect be explained by firm characteristics such as size, leverage, and profitability?

Background: Event Study and Abnormal Returns¶

ℹ️ Background
What are Cumulative Abnormal Returns (CARs)?

An abnormal return (AR) is the difference between a stock's actual return and its expected return based on market movements:

ARt = Rt - E[Rt]

Where:
  • Rt = Actual return on day t
  • E[Rt] = Expected return from market model (α + β × Market Return)

The Cumulative Abnormal Return (CAR) sums abnormal returns over an event window:

CAR[0,5] = AR0 + AR1 + AR2 + AR3 + AR4 + AR5

Interpretation: A CAR of -0.15 means the stock underperformed its expected return by 15 percentage points during the event window.

In this assignment, CARs have been pre-calculated using the market model with Fama-French Europe factors. See the codebook for methodology details. We will come back to this in more detail in assignment 5.
📝 Tasks
In this assignment, you will complete the following graded tasks:
  1. Load and examine the cross-sectional dataset
  2. Perform exploratory analysis (correlations and scatter plots)
  3. Estimate a baseline OLS regression
  4. Add control variables and compare models
  5. Test for heteroskedasticity (Breusch-Pagan test)
  6. Test for normality of residuals (Shapiro-Wilk test)
  7. Re-estimate with robust standard errors
  8. Export a publication-quality regression table
  9. Interpret your findings

Section 1: Load and Examine Data¶

In this section, you'll load the cross-sectional dataset and examine its structure.

Task 1.1: Load the Dataset¶

Load the file auto_firms_event_crosssection.dta from the raw data folder.

Stata Stata Tip
Use use "$raw/filename.dta", clear to load a Stata dataset.
✅ Dataset loaded successfully
✅ Dataset loaded correctly

Task 1.2: Examine the Dataset¶

Use describe to examine the structure of the dataset. How many observations? How many variables?

Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/03-assignment/data/raw/auto_firms_event_crosssection.dta
 Observations:            84                  
    Variables:            24                  06 Jan 2026 15:08
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
gvkey           str6    %-9s                  
conm            str28   %-9s                  
fic             str3    %-9s                  
german          long    %12.0g                
car_0_5         double  %10.0g                
car_0_10        double  %10.0g                
log_at          double  %10.0g                
leverage        double  %10.0g                
roa             double  %10.0g                
debt_equity     double  %10.0g                
roe             double  %10.0g                
margin          double  %10.0g                
cash_ratio      double  %10.0g                
at              double  %10.0g                
sale            double  %10.0g                
dltt            double  %10.0g                
dlc             double  %10.0g                
ceq             double  %10.0g                
ebitda          double  %10.0g                
ib              double  %10.0g                
oibdp           double  %10.0g                
che             double  %10.0g                
n_days_0_5      long    %12.0g                
n_days_0_10     long    %12.0g                
-------------------------------------------------------------------------------
Sorted by: 

Task 1.3: Summary Statistics¶

Generate summary statistics for the key variables: car_0_5, car_0_10, german, log_at, leverage, and roa.

Stata Stata Tip
Use summarize varlist to show mean, SD, min, and max for specified variables.
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     car_0_5 |         84   -.0074558    .0727016  -.3523587   .2633667
    car_0_10 |         84    .0072848    .1414629  -.8872483   .2801222
      german |         84    .0833333    .2780454          0          1
      log_at |         84    10.36085    3.473371   2.858938   18.80747
    leverage |         84    .2161082    .1966058          0   .9964587
-------------+---------------------------------------------------------
         roa |         84    .0615514    .1036582  -.4132906   .3438476
✅ Summary statistics computed correctly

Section 2: Exploratory Analysis¶

Before running regressions, let's explore the relationships between variables using correlations and scatter plots.

Task 2.1: Correlation Matrix¶

Calculate the correlation matrix for car_0_5, german, log_at, leverage, and roa.
Question: Is car_0_5 negatively correlated with german? What does this suggest?

Stata Stata Tip
Use correlate varlist to display a correlation matrix.
(obs=84)

             |  car_0_5   german   log_at leverage      roa
-------------+---------------------------------------------
     car_0_5 |   1.0000
      german |  -0.4954   1.0000
      log_at |  -0.2459  -0.0106   1.0000
    leverage |   0.0084   0.0700   0.0684   1.0000
         roa |  -0.0652   0.0153   0.3246  -0.3275   1.0000

✅ Correlation matrix computed correctly

Task 2.2: Scatter Plot - CAR vs German¶

Create a scatter plot of car_0_5 (y-axis) against german (x-axis). Add proper axis labels and title. Save as graph export "$figures/scatter_car_german.png", replace width(1200).

file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/03-assignment/output/figures/scatter_car_german.png written in PNG for
> mat
✅ Scatter plot saved: scatter_car_german.png
No description has been provided for this image
✅ Scatter plot (CAR vs German) saved correctly

Task 2.3: Scatter Plot - CAR vs Size¶

Create a scatter plot of car_0_5 (y-axis) against log_at (x-axis). Save as graph export "$figures/scatter_car_size.png".

file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/03-assignment/output/figures/scatter_car_size.png written in PNG forma
> t
✅ Scatter plot saved: scatter_car_size.png
No description has been provided for this image
✅ Scatter plot (CAR vs Size) saved correctly

Checkpoint: Exploratory Analysis Complete¶

✅ Exploratory analysis plots created successfully

Section 3: Baseline OLS Regression¶

Now let's estimate a simple regression to test whether German firms experienced more negative CARs.

Task 3.1: Estimate Baseline Model¶

Estimate the regression: car_0_5 = β₀ + β₁×german + ε

Store the estimates with name m1 for later comparison.

Questions to consider:

  • What is the coefficient on german? Is it statistically significant?
  • What does this coefficient mean economically?
  • What is the R-squared? What does it tell us?
Stata Stata Tip
  • regress depvar indepvars — OLS regression
  • estimates store name — Save estimates for later use
      Source |       SS           df       MS      Number of obs   =        84
-------------+----------------------------------   F(1, 82)        =     26.68
       Model |  .107686651         1  .107686651   Prob > F        =    0.0000
    Residual |  .331011944        82  .004036731   R-squared       =    0.2455
-------------+----------------------------------   Adj R-squared   =    0.2363
       Total |  .438698595        83  .005285525   Root MSE        =    .06354

------------------------------------------------------------------------------
     car_0_5 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      german |  -.1295467   .0250819    -5.16   0.000    -.1794425   -.0796508
       _cons |   .0033398   .0072405     0.46   0.646    -.0110639    .0177435
------------------------------------------------------------------------------
✅ Baseline model (m1) estimated and stored
✅ Baseline model (m1) estimated correctly

Section 4: Add Control Variables¶

The baseline model might suffer from omitted variable bias. Let's add control variables for firm characteristics.

ℹ️ Background: Why Add Controls?
Omitted variable bias occurs when we leave out relevant variables that are correlated with both the treatment (German) and the outcome (CAR).

For example:
  • Size (log_at): Larger firms might have more diversified operations and be less affected by single-country scandals
  • Leverage: Highly leveraged firms might be more vulnerable to negative shocks
  • Profitability (roa): More profitable firms might weather bad news better

By including these controls, we can isolate the effect of being German, holding constant these other firm characteristics.

Task 4.1: Regression with Controls¶

Estimate the regression: car_0_5 = β₀ + β₁×german + β₂×log_at + β₃×leverage + β₄×roa + ε

Store the estimates with name m2.

Questions to consider:

  • Did the coefficient on german change compared to the baseline model?
  • What does this tell us about omitted variable bias?
  • Did the R-squared improve? By how much?
      Source |       SS           df       MS      Number of obs   =        84
-------------+----------------------------------   F(4, 79)        =      9.08
       Model |  .138161095         4  .034540274   Prob > F        =    0.0000
    Residual |    .3005375        79  .003804272   R-squared       =    0.3149
-------------+----------------------------------   Adj R-squared   =    0.2802
       Total |  .438698595        83  .005285525   Root MSE        =    .06168

------------------------------------------------------------------------------
     car_0_5 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      german |    -.13205   .0244421    -5.40   0.000    -.1807007   -.0833992
      log_at |  -.0057776   .0021024    -2.75   0.007    -.0099622   -.0015929
    leverage |   .0303046   .0372949     0.81   0.419    -.0439291    .1045383
         roa |   .0413695   .0744365     0.56   0.580    -.1067927    .1895316
       _cons |   .0543134   .0222178     2.44   0.017     .0100899    .0985369
------------------------------------------------------------------------------
✅ Model with controls (m2) estimated and stored
✅ Model with controls (m2) estimated correctly

Task 4.2: Compare Models¶

Use estimates table to display both models side-by-side for easy comparison.

Stata Stata Tip
Use estimates table m1 m2, star stats(N r2) to compare stored models with stars for significance and fit statistics.
----------------------------------------------
    Variable |      m1              m2        
-------------+--------------------------------
      german | -.12954665***   -.13204996***  
      log_at |                 -.00577757**   
    leverage |                  .03030462     
         roa |                  .04136945     
       _cons |  .00333977       .05431344*    
-------------+--------------------------------
           N |         84              84     
          r2 |  .24546842       .31493398     
----------------------------------------------
      Legend: * p<0.05; ** p<0.01; *** p<0.001
✅ Model with controls has correct specification

Section 5: Diagnostics - Heteroskedasticity¶

OLS assumes homoskedasticity: constant variance of errors. Let's test this assumption.

ℹ️ Background: Heteroskedasticity
Heteroskedasticity means the variance of errors is not constant across observations. This violates OLS assumptions and leads to:
  • Inefficient coefficient estimates (not the best linear unbiased estimators)
  • Biased standard errors → incorrect t-statistics and p-values
  • Invalid hypothesis tests

Breusch-Pagan test: Tests H₀: homoskedasticity vs H₁: heteroskedasticity
If p-value < 0.05, we reject H₀ and conclude heteroskedasticity is present.

Task 5.1: Generate Residuals and Fitted Values¶

Re-estimate model 2 from Task 4.1 (car_0_5 german log_at leverage roa) and generate:

  • Residuals (call them resid)
  • Fitted values (call them yhat)
Stata Stata Tip
After a regression, use:
  • predict varname, residuals — Generate residuals
  • predict varname — Generate fitted values (default)
(option xb assumed; fitted values)
✅ Residuals and fitted values generated
✅ Residuals and fitted values generated correctly

Task 5.2: Plot Residuals vs Fitted Values¶

Create a scatter plot of residuals (y-axis) against fitted values (x-axis). Add a horizontal line at y=0. Save as graph export "$figures/residuals_vs_fitted.png", replace width(1200).

Interpretation: If the plot shows a "funnel" shape (residuals spreading out), that suggests heteroskedasticity.

file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/03-assignment/output/figures/residuals_vs_fitted.png written in PNG fo
> rmat
✅ Residual plot saved: residuals_vs_fitted.png
No description has been provided for this image
✅ Residual plot created and saved correctly

Task 5.3: Breusch-Pagan Test¶

Re-estimate model 2 from Task 4.1, then perform the Breusch-Pagan test for heteroskedasticity.

Question: What is the p-value? Do we reject the null hypothesis of homoskedasticity?

Stata Stata Tip
After a regression, use estat hettest to perform the Breusch-Pagan test.
Breusch–Pagan/Cook–Weisberg test for heteroskedasticity 
Assumption: Normal error terms
Variable: Fitted values of car_0_5

H0: Constant variance

    chi2(1) =  23.19
Prob > chi2 = 0.0000
✅ Breusch-Pagan test completed

Test: Heteroskedasticity Diagnostics¶

✅ Heteroskedasticity diagnostics completed

Section 6: Diagnostics - Normality¶

OLS assumes that errors are normally distributed. Let's test this assumption.

Task 6.1: Histogram of Residuals¶

Create a histogram of the residuals with a normal distribution overlay. Save as graph export "$figures/histogram_residuals.png", replace width(1200).

Interpretation: If the histogram closely matches the normal curve, the normality assumption is reasonable.

Stata Stata Tip
Use histogram varname, normal to create a histogram with normal density overlay.
(bin=9, start=-.21515974, width=.05122012)
file /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assig
> nments/03-assignment/output/figures/histogram_residuals.png written in PNG fo
> rmat
✅ Residual histogram saved: histogram_residuals.png
No description has been provided for this image
✅ Residual histogram saved correctly

Task 6.2: Shapiro-Wilk Test¶

Perform the Shapiro-Wilk test for normality on the residuals.

Question: What is the p-value? Do we reject the null hypothesis of normality?

Stata Stata Tip
Use swilk resid to perform the Shapiro-Wilk test for normality.
                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
       resid |         84    0.92648      5.253     3.644    0.00013
✅ Shapiro-Wilk test completed correctly

Section 7: Robust Standard Errors¶

If heteroskedasticity is present, we should use robust standard errors (Huber-White standard errors) to get valid inference.

ℹ️ Background: Robust Standard Errors
Robust (heteroskedasticity-consistent) standard errors adjust for heteroskedasticity without changing coefficient estimates.

When to use:
  • Breusch-Pagan test rejects homoskedasticity (p < 0.05)
  • Visual inspection of residual plots shows non-constant variance
  • As a precaution in cross-sectional data (common practice)

Effect: Coefficients stay the same, but standard errors (and thus t-stats and p-values) change to be valid under heteroskedasticity.

Task 7.1: Re-estimate with Robust Standard Errors¶

Re-estimate model 2 (regress car_0_5 german log_at leverage roa) using the robust option. Store as m3.

Question: Did the standard errors change? Did the coefficients change?

Stata Stata Tip
Add the robust option to the regress command: regress y x, robust
Linear regression                               Number of obs     =         84
                                                F(4, 79)          =       4.61
                                                Prob > F          =     0.0021
                                                R-squared         =     0.3149
                                                Root MSE          =     .06168

------------------------------------------------------------------------------
             |               Robust
     car_0_5 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      german |    -.13205   .0480099    -2.75   0.007    -.2276113   -.0364886
      log_at |  -.0057776   .0016403    -3.52   0.001    -.0090425   -.0025126
    leverage |   .0303046   .0388468     0.78   0.438     -.047018    .1076272
         roa |   .0413695   .0601818     0.69   0.494    -.0784194    .1611583
       _cons |   .0543134   .0182133     2.98   0.004     .0180608    .0905661
------------------------------------------------------------------------------
✅ Model with robust SE (m3) estimated and stored
✅ Robust SE model (m3) estimated correctly

Task 7.2: Compare All Three Models¶

Display all three models (m1, m2, m3) side-by-side using estimates table m1,..., star stats(N r2) .

--------------------------------------------------------------
    Variable |      m1              m2              m3        
-------------+------------------------------------------------
      german | -.12954665***   -.13204996***   -.13204996**   
      log_at |                 -.00577757**    -.00577757***  
    leverage |                  .03030462       .03030462     
         roa |                  .04136945       .04136945     
       _cons |  .00333977       .05431344*      .05431344**   
-------------+------------------------------------------------
           N |         84              84              84     
          r2 |  .24546842       .31493398       .31493398     
--------------------------------------------------------------
                      Legend: * p<0.05; ** p<0.01; *** p<0.001

Section 8: Export Publication-Quality Regression Table¶

Now let's create a professional regression table using esttab.

Task 8.1: Create LaTeX Regression Table¶

Use esttab to export all three models to a LaTeX file called esttab m1 m2 m3 using "$tables/regression_table.tex", .

Requirements:

  • Include standard errors below coefficients
  • Add significance stars (\*, **, \***)
  • Include R² and number of observations
  • Add appropriate variable labels
Stata Stata Tip: esttab Options
Key esttab options:
  • se — Include standard errors
  • star(* 0.10 ** 0.05 *** 0.01) — Significance stars
  • stats(N r2) — Add N and R²
  • label — Use variable labels instead of names
  • replace — Overwrite existing file
Tip: Use label variable varname "Label" before running esttab to give your variables readable names in the table (e.g., label variable log_at "Log(Total Assets)").
(output written to /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/03-assignment/output/tables/regression_table.tex)
✅ Regression table exported: regression_table.tex
✅ Regression table exported correctly

Section 9: Interpretation and Conclusions (optional/solution included)¶

Finally, let's interpret our findings and discuss limitations.

Task 9.1: Interpret Main Findings¶

In the markdown cell below, write 3-4 sentences interpreting your findings. Address:

  • What is the magnitude of the German effect? (e.g., "German firms experienced X percentage points more negative returns...")
  • Is this effect statistically significant?
  • Did adding controls change the coefficient? What does this tell us?
  • Based on the R², how much of the variation in CARs is explained by our model?

INTERPRETATION

Based on the regression results, German automakers experienced more negative cumulative abnormal (about 13 basis points) returns compared to other automotive firms during the Dieselgate scandal (days 0-5). This effect is statistically significant at conventional levels (p < 0.01), suggesting the market viewed the scandal as primarily affecting German manufacturers rather than the global automotive industry.

When we add control variables for firm size, leverage, and profitability, the coefficient on German firm status remains large and significant, indicating the effect is not simply due to German firms being different along these observable dimensions. However, the relatively modest R² (around 0.25-0.31) suggests that firm nationality and fundamentals explain only a portion of the cross-sectional variation in stock returns—idiosyncratic firm characteristics and investor sentiment likely play important roles.

The robust standard errors are slightly larger than the conventional standard errors, consistent with the presence of heteroskedasticity detected in the Breusch-Pagan test. Using robust standard errors provides more reliable inference in this setting.

Task 9.2: Discuss Limitations¶

In the markdown cell below, list 2-3 limitations of this analysis. Consider:

  • Sample size
  • Omitted variables
  • Causality vs correlation
  • Alternative explanations

LIMITATIONS:

  1. Small sample size: With only 84 firms (7 German), our statistical power is limited, and results may be sensitive to outliers or individual firm characteristics.

  2. Omitted variable bias: We control for basic firm fundamentals (size, leverage, profitability), but other important factors are missing—such as diesel exposure (% of sales from diesel vehicles), geographic revenue composition, or brand reputation—which could confound the German effect.

  3. Correlation vs causation: While we find a strong association between German firm status and negative CARs, we cannot definitively claim causation. The effect could be driven by factors correlated with being German (e.g., regulatory exposure, media coverage) rather than German nationality per se. A more sophisticated identification strategy (e.g., difference-in-differences with non-automotive German firms as a control group) would strengthen causal claims.


References¶

  • Fama-French Factors: Kenneth French Data Library - https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
  • Event Study Methodology: MacKinlay, A. C. (1997). "Event Studies in Economics and Finance," Journal of Economic Literature, 35(1), 13-39
  • Dieselgate: EPA Notice of Violation (September 18, 2015) - https://www.epa.gov/vw
  • Stata Documentation: https://www.stata.com/features/documentation/
  • Robust Standard Errors: White, H. (1980). "A Heteroskedasticity-Consistent Covariance Matrix Estimator," Econometrica, 48(4), 817-838

Data Analytics for Finance

BM17FI · Academic Year 2025–26

Erasmus University Rotterdam

Created by: Caspar David Peter

© 2026 Rotterdam School of Management