Data Analytics for Finance

BM17FI · Rotterdam School of Management

RSM Logo

ASSIGNMENT 01

Data Wrangling

Learning Objectives¶

In this assignment, you will learn to:

  1. Import data from CSV files into Stata
  2. Examine data structure using describe, list, summarize
  3. Clean data: handle missing values, drop irrelevant observations
  4. Merge datasets using merge command with different identifier types
  5. Declare panel data structure with xtset for time-series operations
  6. Save processed data in Stata format (.dta)

Context: The Volkswagen Dieselgate Scandal¶

On September 18, 2015, the U.S. Environmental Protection Agency (EPA) issued a Notice of Violation to Volkswagen AG for installing "defeat devices" in diesel vehicles to cheat emissions tests. This scandal affected millions of vehicles worldwide and had significant impacts on Volkswagen and the broader automotive industry.

Throughout this assignment series, you will analyze the market and reputational consequences of this event using financial data from German automakers.


Exercises¶

📝 Assignment Tasks
  1. Load and examine company identifier data
  2. Load and examine daily stock price data
  3. Clean the data by removing observations with missing prices
  4. Merge the two datasets using company identifiers
  5. Declare panel data structure with xtset
  6. Create new variables: returns, year, and month
  7. Generate summary statistics by firm
  8. Save the cleaned dataset in Stata format

Setup¶

Clear Environment¶

We start by clearing Stata's memory and disabling pagination.

✅ The environment is cleared and ready.

Set File Paths¶

We define global macros for all data and output directories. This makes our code portable and easier to maintain.

/Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assignment
> s/01-assignment
📁 Base directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/01-assignment
📁 Raw data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-privat
> e/private/assignments/01-assignment/data/raw
📁 Processed data folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-
> private/private/assignments/01-assignment/data/processed
📁 Output directory: /Users/casparm4/Github/rsm-data-analytics-in-finance-priva
> te/private/assignments/01-assignment/output
📁 Tables folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/
> private/assignments/01-assignment/output/tables
📁 Figures folder: /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/01-assignment/output/figures

Section 1: Load and Examine Header Data¶

Task 1.1: Import Company Identifiers¶

What you'll do: Load the auto_firms_g_names.csv file into Stata's memory.

Why this matters: This file contains company identifiers and basic information (company names, industry codes, securities identifiers) for German automakers. These identifiers will allow us to link firms across different datasets.

What to expect: You'll see 7 German automotive companies with their identifying information. Each row represents one firm.

Stata Stata Tip
Use import delimited "$raw/filename.csv", clear to load CSV files into Stata. The clear option removes any data currently in memory.
(encoding automatically selected: ISO-8859-1)
(7 vars, 7 obs)

Task 1.2: Examine the Data Structure¶

What you'll do: Inspect the structure and contents of the header data you just loaded.

Why this matters: Before working with any dataset, you must understand its structure: What variables does it contain? What are their types? How many observations are there? This prevents errors and helps you plan your analysis.

Key variables to examine:

  • gvkey: Compustat Global Company Key (the unique firm identifier)
  • conm: Company name
  • sic: Standard Industrial Classification code
  • isin: International Securities Identification Number

What to expect: You should see 4 firms, all with SIC code 3711 (Motor Vehicles & Passenger Car Bodies), all incorporated in Germany (fic = "DEU").

Stata Stata Tip
Consider using the following commands:
  • describe — shows variable names, types, and storage formats
  • list in 1/7 — displays first 10 observations
  • count — shows number of observations
Contains data
 Observations:             7                  
    Variables:             7                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
gvkey           long    %12.0g                
conm            str28   %28s                  
sic             int     %8.0g                 
naics           long    %12.0g                
fic             str3    %9s                   
isin            str12   %12s                  
sedol           str7    %9s                   
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

     +-------------------------------------------------------------+
  1. |  gvkey |                         conm |  sic |  naics | fic |
     |  17828 |       MERCEDES BENZ GROUP AG | 3711 | 336110 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE0007100000          |          5529027           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  2. |  gvkey |                         conm |  sic |  naics | fic |
     | 100022 | BAYERISCHE MOTOREN WERKE AKT | 3711 | 336111 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE0005190003          |          5756029           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  3. |  gvkey |                         conm |  sic |  naics | fic |
     | 100042 |                       MAN SE | 3711 | 336120 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE0005937007          |          5563520           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  4. |  gvkey |                         conm |  sic |  naics | fic |
     | 100737 |                VOLKSWAGEN AG | 3711 | 336110 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE0007664039          |          5497168           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  5. |  gvkey |                         conm |  sic |  naics | fic |
     | 101120 |                      AUDI AG | 3711 | 336111 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE0006757008          |          5761498           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  6. |  gvkey |                         conm |  sic |  naics | fic |
     | 102187 | PORSCHE AUTOMOBIL HOLDING SE | 3711 | 336111 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE000PAH0038          |          7101069           |
     +-------------------------------------------------------------+

     +-------------------------------------------------------------+
  7. |  gvkey |                         conm |  sic |  naics | fic |
     | 284390 |                       HWA AG | 3711 |   3361 | DEU |
     |-------------------------------------------------------------|
     |                  isin          |            sedol           |
     |          DE000A0LR4P1          |          B1W6080           |
     +-------------------------------------------------------------+
  7
---- CHECKPOINT: header data loaded ----
Number of firms: 7
✅ Test passed: Header data loaded correctly with 7 firms.

Task 1.3: Filter Firms for Analysis¶

What you'll do: Remove firms that are not independent automakers—specifically subsidiaries of the Volkswagen Group and specialized motorsport companies.

Why this matters: For a clean analysis of the Dieselgate event's impact, we want to focus on independent German automakers:

  • Audi AG (gvkey: 101120) is a wholly-owned subsidiary of Volkswagen. Their stock performance doesn't represent an independent market reaction.
  • Porsche Automobil Holding SE (gvkey: 102187) is the controlling shareholder of Volkswagen Group. Their reaction conflates ownership and operational exposure.
  • HWA AG (gvkey: 284390) is primarily a motorsport/tuning company, not a traditional volume automaker.

By removing these firms, we create a cleaner comparison set of independent German automakers: Volkswagen, BMW, Mercedes-Benz, and MAN.

What to expect: After filtering, you should have exactly 4 firms remaining.

Stata Stata Tip
Use drop if gvkey == code to remove specific firms. You can repeat this command for each firm to drop, or use the OR operator: drop if gvkey == code1 | gvkey == code2 | ...
(1 observation deleted)
(1 observation deleted)
(1 observation deleted)

     +---------------------------------------+
     |  gvkey                           conm |
     |---------------------------------------|
  1. |  17828         MERCEDES BENZ GROUP AG |
  2. | 100022   BAYERISCHE MOTOREN WERKE AKT |
  3. | 100042                         MAN SE |
  4. | 100737                  VOLKSWAGEN AG |
     +---------------------------------------+
---- CHECKPOINT: firm filtering ----
Remaining firms: 4
     +---------------------------------------+
     |  gvkey                           conm |
     |---------------------------------------|
  1. |  17828         MERCEDES BENZ GROUP AG |
  2. | 100022   BAYERISCHE MOTOREN WERKE AKT |
  3. | 100042                         MAN SE |
  4. | 100737                  VOLKSWAGEN AG |
     +---------------------------------------+
✅ Test passed: 4 independent firms retained after filtering.

Task 1.4: Save the Cleaned Header Data¶

What you'll do: Save the data (with the 4 filtered firms) to a temporary file for merging with the price data in the next section.

Why this matters: We need the header data saved in a format that can be merged with the price data. Otherwise, we would have to repeat the loading and filtering steps every time we want to merge or analyze the data. By saving a cleaned version of the header data, we streamline our workflow for the next sections.

What to expect: The header data will saved for the upcoming merge in the "processed" folder.

Stata Stata Tip
Save to a temporary .dta file: save "$processed/temp_header.dta", replace
(file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/01-assignment/data/processed/temp_header.dta not found)
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/01-assignment/data/processed/temp_header.dta saved
---- CHECKPOINT: header data prepared ----
Header data filtered and saved with 4 firms
✅ Test passed: File saved successfully in processed folder.

Section 2: Load and Examine Price Data¶

Task 2.1: Import Daily Stock Prices¶

What you'll do: Load the auto_firms_g_secd.csv file containing daily stock prices.

Why this matters: This file contains the time-series data we'll use for our event study analysis. It includes daily closing prices, trading volumes, and adjustment factors for the period 2013-2017.

What to expect: This is a much larger dataset than the header file—you should see thousands of daily observations (approximately 1,300 trading days × 4 firms = ~9,100 rows).

(encoding automatically selected: ISO-8859-2)
(4 vars, 9,128 obs)

Task 2.2: Examine Price Data¶

What you'll do: Inspect the price data structure and generate summary statistics.

Why this matters: Understanding the distribution and range of prices helps you spot potential data issues (negative prices, extreme outliers, etc.) before they cause problems in your analysis.

Key variables to examine:

  • gvkey: Firm identifier (should match the header file)
  • datadate: Trading date (currently stored as text)
  • prccd: Closing price in local currency (EUR)
  • ajexdi: Adjustment factor for stock splits and dividends

What to expect: Prices will vary widely across firms (some trade at €20, others at €200). You'll also see the data spans multiple years.

Contains data
 Observations:         9,128                  
    Variables:             4                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
gvkey           long    %12.0g                
datadate        str10   %10s                  
prccd           float   %9.0g                 
ajexdi          float   %9.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       gvkey |      9,128    115189.4    74783.26      17828     284390
    datadate |          0
       prccd |      9,128    162.3696    207.5795     12.975        861
      ajexdi |      9,128    1.026389    .0646436          1   1.184724

     +--------------------------------------+
     | gvkey     datadate    prccd   ajexdi |
     |--------------------------------------|
  1. | 17828   2013-01-01    41.32        1 |
  2. | 17828   2013-01-02     42.6        1 |
  3. | 17828   2013-01-03   42.455        1 |
  4. | 17828   2013-01-04    42.63        1 |
  5. | 17828   2013-01-07    43.06        1 |
     |--------------------------------------|
  6. | 17828   2013-01-08     42.5        1 |
  7. | 17828   2013-01-09   42.705        1 |
  8. | 17828   2013-01-10    42.89        1 |
  9. | 17828   2013-01-11    43.09        1 |
 10. | 17828   2013-01-14   43.655        1 |
     |--------------------------------------|
 11. | 17828   2013-01-15   43.785        1 |
 12. | 17828   2013-01-16   43.715        1 |
 13. | 17828   2013-01-17   43.885        1 |
 14. | 17828   2013-01-18   43.345        1 |
 15. | 17828   2013-01-21   43.195        1 |
     |--------------------------------------|
 16. | 17828   2013-01-22    43.01        1 |
 17. | 17828   2013-01-23   43.155        1 |
 18. | 17828   2013-01-24   43.255        1 |
 19. | 17828   2013-01-25   43.805        1 |
 20. | 17828   2013-01-28    43.95        1 |
     +--------------------------------------+
---- CHECKPOINT: price data loaded ----
Number of observations: 9128
✅ Test passed: Price data loaded with 9128 observations.

Task 2.3: Convert Date Variable¶

What you'll do: Convert the datadate variable from text format ("2015-09-18") to Stata's internal date format (a number representing days since January 1, 1960).

Why this matters: Stata can only perform date calculations (sorting by time, computing differences, creating lags) if dates are stored in its internal numeric format. Text dates like "2015-09-18" look like dates to humans, but Stata treats them as text strings.

What you'll create: A new variable called date that contains the numeric date value, which you can then use for time-series operations.

What to expect: After conversion, you'll see dates displayed in a readable format (01jan2013, 02jan2013, etc.) even though they're stored internally as numbers.

Stata Stata Tip
  • gen date = date(datadate, "YMD") — converts YYYY-MM-DD text to Stata date
  • format date %td — displays dates in readable format (01jan2013)
  • drop datadate — removes the now-redundant text variable
     +----------------------------+
     | gvkey        date    prccd |
     |----------------------------|
  1. | 17828   01jan2013    41.32 |
  2. | 17828   02jan2013     42.6 |
  3. | 17828   03jan2013   42.455 |
  4. | 17828   04jan2013    42.63 |
  5. | 17828   07jan2013    43.06 |
     |----------------------------|
  6. | 17828   08jan2013     42.5 |
  7. | 17828   09jan2013   42.705 |
  8. | 17828   10jan2013    42.89 |
  9. | 17828   11jan2013    43.09 |
 10. | 17828   14jan2013   43.655 |
     +----------------------------+
---- CHECKPOINT: date conversion ----
Date variable created and formatted
✅ Test passed: Date variable is in proper Stata format.

Section 3: Data Cleaning¶

Task 3.1: Identify Missing Values (if any)¶

What you'll do: Check how many observations have missing values in the prccd (closing price) variable.

Why this matters: Missing prices can occur when trading is halted, data wasn't collected, or there were data transmission errors. These missing values will cause problems in calculations (returns, averages, etc.), so we need to identify and handle them.

What to expect: You may find a small number of observations with missing prices. This is normal in financial data.

Stata Stata Tip
  • misstable summarize prccd — shows pattern of missing values
  • count if missing(prccd) — counts how many observations have missing prices
(variables nonmissing or string)
  0
---- CHECKPOINT: missing value check ----
Missing prices: 0

Task 3.2: Remove Missing Prices¶

What you'll do: Delete all observations where prccd (closing price) is missing.

Why this matters: We cannot calculate returns or perform meaningful analysis on days with missing prices. Rather than imputing (guessing) these values, the standard approach in finance is to remove these observations. This ensures all our subsequent calculations are based on actual observed prices.

What to expect: After dropping missing values, you should have slightly fewer observations than before (the exact number depends on how many missing values were found).

Stata Stata Tip
Use drop if missing(prccd) to remove all observations where the closing price is missing.
(0 observations deleted)
---- CHECKPOINT: data cleaning ----
Observations after removing missing prices: 9128
✅ Test passed: No missing values in price data.

Task 4.1: Understanding Stata's Data Model¶

Before we merge our datasets, we need to understand an important Stata limitation and how to work around it.

ℹ️ Background: Stata's Single-Dataset Model
Unlike some statistical software that can hold multiple datasets in memory simultaneously, Stata can only work with one dataset at a time. This creates a challenge when we need to merge data:

  • We currently have price data in memory
  • We need to merge it with header data (firm names, identifiers)
  • But loading the header data would erase our price data!
Solution: Save the price data to a file, load the header data, save it to another file, then reload the price data and merge.
Stata Stata Tip: Save and Load Workflow
The workflow:
  1. Save the current price data to a temporary file
  2. Load the header data (this replaces price data in memory)
  3. Save the header data to another temporary file
  4. Reload the price data from the first file
  5. Merge with the saved header file
This explicit approach makes it clear what's happening at each step.

Task 4.2: Save Price Data Temporarily¶

What you'll do: Save the cleaned price data to a temporary file so we can load the header data without losing our work.

Why this matters: This is step 1 of our merge workflow—preserving the price data we've just cleaned before loading the header file.

What to expect: Stata will save the current data to the processed folder as a temporary file.

Stata Stata Tip
Use save "$processed/temp_prices.dta", replace to save the current dataset. We'll delete this file after the merge is complete.
(file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/01-assignment/data/processed/temp_prices.dta not found)
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/01-assignment/data/processed/temp_prices.dta saved
---- CHECKPOINT: price data saved ----
Cleaned price data saved temporarily

Task 4.3: Reload Price Data and Merge¶

What you'll do: Load the saved price data back into memory and merge it with the header data.

Why this matters: This completes our merge workflow—we now have both datasets ready to combine using the gvkey identifier.

Type of merge: This is a many-to-one (m:1) merge because:

  • Price data (master): Many observations per firm (one for each trading day)
  • Header data (using): One observation per firm
  • Each of the many price observations will match to the one corresponding header observation

What to expect: After merging, each price observation will have the firm's name, SIC code, ISIN, and other identifiers attached to it. You can use tab _merge for an overview of how many observations matched successfully.

Stata Stata Tip
  • 1:1 merge: One observation per identifier in both datasets
  • m:1 merge: Many observations in current data match one in merge file
  • 1:m merge: One observation in current data matches many in merge file
use "$processed/temp_prices.dta", clear
merge m:1 gvkey using "$processed/temp_header.dta"
tab _merge
🚀 Above and Beyond: preserve and restore
An alternative approach uses preserve and restore commands:

preserve takes a snapshot of the current data in Stata's memory, restore brings it back. This is more elegant but less explicit:

preserve
import delimited "$raw/auto_firms_g_names.csv", clear
drop if gvkey == "101120" | gvkey == "102187" | gvkey == "284390"
tempfile header
save `header'
restore
merge m:1 gvkey using `header'
The tempfile creates a temporary file that auto-deletes when Stata closes. Both approaches work—the explicit save/load is clearer for beginners, while preserve/restore is more concise for experienced users.
    Result                      Number of obs
    -----------------------------------------
    Not matched                         3,912
        from master                     3,912  (_merge==1)
        from using                          0  (_merge==2)

    Matched                             5,216  (_merge==3)
    -----------------------------------------

   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
        Master only (1) |      3,912       42.86       42.86
            Matched (3) |      5,216       57.14      100.00
------------------------+-----------------------------------
                  Total |      9,128      100.00
---- CHECKPOINT: merge completed ----
Merge results displayed above; temporary files deleted

Task 4.4: Keep Only Matched Observations¶

What you'll do: Keep only observations that successfully matched in both datasets (where keep if _merge == 3).

Why this matters: Stata creates a special variable called _merge to show the merge result for each observation. We only want observations that appear in both datasets—these are the firms for which we have both identifier information AND price data. You can use tab _merge for an overview of how many observations matched successfully, if you have not already done so.

What to expect: All observations should match (_merge == 3) since both datasets contain the same 4 firms. After keeping matched observations and dropping the _merge variable, you should have the same number of observations as before the merge.

ℹ️ Background: Understanding _merge
The _merge variable shows merge results:
  • _merge == 1: Only in master (price data) — would indicate a firm with prices but no identifier info
  • _merge == 2: Only in using (header data) — would indicate a firm with identifiers but no prices
  • _merge == 3: Matched in both datasets — this is what we want!
(3,912 observations deleted)
---- CHECKPOINT: matched observations retained ----
Final observation count: 5216
✅ Test passed: Datasets merged successfully.

Section 5: Declare Panel Structure and Create Variables¶

Task 5.1: Declare Panel Data Structure with xtset¶

What you'll do: Tell Stata that this is panel data (multiple firms observed over time) using the xtset command.

Why this matters: Our dataset has a special structure—it tracks the same firms (panel units) across multiple time periods. Once we declare this structure with xtset, Stata unlocks powerful time-series operators that make working with lagged variables, leads, and differences much easier and less error-prone.

What xtset enables:

  • Time-series operators: L.variable (lag), F.variable (lead), D.variable (difference)
  • Panel-aware commands: xtreg, xtdescribe, xtsum
  • Automatic checks: Stata will warn you if data isn't sorted correctly

Panel structure for our data:

  • Panel variable (gvkey): Identifies which firm each observation belongs to
  • Time variable (date): Identifies when each observation occurred

What to expect: Stata will sort the data by gvkey and date, then report information about the panel structure (number of firms, time periods per firm, gaps in the data, etc.).

ℹ️ Background: What is Panel Data?
Panel data (also called longitudinal data) tracks the same entities over time:
  • Cross-sectional data: One observation per entity (4 firms, one time point)
  • Time-series data: One entity observed many times (VW stock, 2013-2017)
  • Panel data: Multiple entities, each observed many times (4 firms × ~1,300 days)
Panel data is powerful because it lets us control for firm-specific characteristics and study changes over time.
Stata Stata Tip
Panel Structure
Syntax: xtset panelvar timevar
  • panelvar: Variable identifying panel units (firms) → gvkey
  • timevar: Variable identifying time periods → date
After xtset, use xtdescribe to examine panel characteristics.
Panel variable: gvkey (strongly balanced)
 Time variable: date, 01jan2013 to 29dec2017, but with gaps
         Delta: 1 day

   gvkey:  17828, 100022, ..., 100737                        n =          4
    date:  01jan2013, 02jan2013, ..., 29dec2017              T =       1304
           Delta(date) = 1 day
           Span(date)  = 1824 periods
           (gvkey*date uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                      1304    1304    1304      1304      1304    1304    1304

     Freq.  Percent    Cum. |  Pattern*
 ---------------------------+--------------------------------------------------
> ------------------------------------------------------
        4    100.00  100.00 |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX5
 ---------------------------+--------------------------------------------------
> ------------------------------------------------------
        4    100.00         |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 ------------------------------------------------------------------------------
> ------------------------------------------------------
 *Each column represents 18 periods.

---- CHECKPOINT: panel structure declared ----
Panel data structure set with gvkey as panel variable and date as time variable
✅ Test passed: Panel structure correctly declared.

Task 5.2: Generate Daily Returns Using L. Operator¶

What you'll do: Calculate daily log returns for each firm using Stata's lag operator.

Why log returns: Returns measure the percentage change in stock prices from one day to the next. We use log returns rather than simple percentage returns because they have better statistical properties:

  • Symmetry: A 10% gain followed by a 10% loss doesn't return you to the starting point with simple returns, but log returns are symmetric
  • Time-additivity: Log returns over multiple periods sum correctly
  • Approximate normality: Daily log returns are closer to normally distributed

The formula: $$r_t = \ln(P_t / P_{t-1}) = \ln(P_t) - \ln(P_{t-1})$$

where $P_t$ is today's closing price and $P_{t-1}$ is yesterday's closing price.

Using the L. operator: Because we declared the panel structure with xtset, we can use L.prccd to automatically reference yesterday's price for the same firm. Stata handles the complexity of ensuring we don't accidentally use another firm's price.

What to expect:

  • A new variable ret will be created with daily log returns
  • The first observation for each firm will have a missing return (no previous price to compare to)
  • Most returns will be small numbers close to zero (e.g., -0.02 to +0.02 for ±2%)
  • You may see some larger returns around major events
Stata Stata Tip
Lag Operator
Use: gen ret = ln(prccd/L.prccd)
  • L.prccd automatically references the previous period's price for the same firm
  • No need for sorting or conditional statements—xtset handles this
  • Other operators: F. (forward/lead), D. (difference), L2. (2-period lag)
🚀 Above and Beyond
You might see alternative code that calculates returns using gen ret = ln(prccd/prccd[_n-1]) if gvkey==gvkey[_n-1]. This approach manually references the previous observation using [_n-1] and checks that we're still in the same firm with the conditional statement.

Both methods produce the same result, but the L. operator is preferred because:
  • Less error-prone (no need to remember the conditional)
  • More readable and easier to maintain
  • Works correctly even with gaps in the data
  • Standard approach in finance research
(1,044 missing values generated)

                             ret
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.0398459      -.1842681
 5%    -.0236866      -.0857544
10%    -.0152512      -.0782709       Obs               4,172
25%    -.0057539      -.0727594       Sum of wgt.       4,172

50%            0                      Mean           .0003003
                        Largest       Std. dev.      .0144726
75%     .0062713       .0669525
90%     .0161934       .0703134       Variance       .0002095
95%     .0248292       .0721684       Skewness      -.4896224
99%     .0403118       .0801751       Kurtosis       12.09002
---- CHECKPOINT: returns created ----
Returns variable created successfully using L. operator
✅ Test passed: Returns variable created.
✅ Test passed: Returns are in reasonable range for daily data.

Task 5.3: Create Year and Month Variables¶

What you'll do: Extract year and month components from the date variable.

Why this matters: Year and month variables allow you to:

  • Group observations by time periods (e.g., calculate monthly average returns)
  • Create time-based subsets (e.g., analyze only 2015 data)
  • Generate time-based summary statistics
  • Create time fixed effects in regressions (later assignments)

What to expect: Two new integer variables:

  • year: Will contain values like 2013, 2014, 2015, 2016, 2017
  • month: Will contain values from 1 (January) to 12 (December)
Stata Stata Tip
Use Stata's date extraction functions:
  • gen year = year(date) — extracts the year
  • gen month = month(date) — extracts the month
  • Also available: day(), dow() (day of week), quarter()
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
       2013 |      1,044       20.02       20.02
       2014 |      1,044       20.02       40.03
       2015 |      1,044       20.02       60.05
       2016 |      1,044       20.02       80.06
       2017 |      1,040       19.94      100.00
------------+-----------------------------------
      Total |      5,216      100.00
---- CHECKPOINT: time variables created ----
Year and month variables created
✅ Test passed: Year variable created.
✅ Test passed: Month variable created.

Task 5.4: Label Variables¶

What you'll do: Add descriptive labels to variables to document what they contain.

Why this matters: Variable labels make your dataset self-documenting. When you (or someone else) opens this file in 6 months, the labels will explain what each variable represents without needing to consult external documentation.

Variables to label (at minimum):

  • gvkey: Compustat company identifier
  • conm: Company name
  • date: Trading date
  • prccd: Closing price (local currency)
  • ret: Daily log return
  • year: Year
  • month: Month

What to expect: When you run describe, you'll see your descriptive labels next to each variable name.

Stata Stata Tip
Use label variable varname "Descriptive label here" to add labels. Keep labels concise but informative.
Contains data from /Users/casparm4/Github/rsm-data-analytics-in-finance-private
> /private/assignments/01-assignment/data/processed/temp_prices.dta
 Observations:         5,216                  
    Variables:            13                  10 Feb 2026 09:26
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
gvkey           long    %12.0g                Compustat company identifier
prccd           float   %9.0g                 Closing price (local currency)
ajexdi          float   %9.0g                 
date            float   %td                   Trading date
conm            str28   %28s                  Company name
sic             int     %8.0g                 
naics           long    %12.0g                
fic             str3    %9s                   
isin            str12   %12s                  
sedol           str7    %9s                   
ret             float   %9.0g                 Daily log return
year            float   %9.0g                 Year
month           float   %9.0g                 Month
-------------------------------------------------------------------------------
Sorted by: gvkey  date
     Note: Dataset has changed since last saved.
---- CHECKPOINT: variables labeled ----
Variable labels applied
✅ Test passed: Variables are properly labeled.

Section 6: Summary Statistics¶

Task 6.1: Overall Summary Statistics¶

What you'll do: Generate comprehensive summary statistics for the key variables prccd (price) and ret (returns).

Why this matters: Summary statistics help you:

  • Understand the distribution of your data (center, spread, shape)
  • Identify potential outliers or data errors
  • Get a sense of the typical magnitudes you're working with
  • Spot asymmetries or unusual patterns

What to expect:

  • Prices: Wide variation (€20 to €200+), positive values only
  • Returns: Centered near zero, small standard deviation (~0.01-0.02), some extreme values during volatile periods
  • Percentiles showing the distribution across the entire sample
Stata Stata Tip
Use summarize prccd ret, detail for comprehensive statistics including percentiles, skewness, and kurtosis.
               Closing price (local currency)
-------------------------------------------------------------
      Percentiles      Smallest
 1%       43.625         38.645
 5%        58.83         38.725
10%         63.1          38.84       Obs               5,216
25%        73.97         39.435       Sum of wgt.       5,216

50%           90                      Mean           100.8507
                        Largest       Std. dev.      39.92719
75%       115.45          242.3
90%       168.65          244.8       Variance       1594.181
95%        185.8          247.5       Skewness       1.286392
99%       224.85         247.55       Kurtosis       4.045226

                      Daily log return
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.0398459      -.1842681
 5%    -.0236866      -.0857544
10%    -.0152512      -.0782709       Obs               4,172
25%    -.0057539      -.0727594       Sum of wgt.       4,172

50%            0                      Mean           .0003003
                        Largest       Std. dev.      .0144726
75%     .0062713       .0669525
90%     .0161934       .0703134       Variance       .0002095
95%     .0248292       .0721684       Skewness      -.4896224
99%     .0403118       .0801751       Kurtosis       12.09002
---- CHECKPOINT: summary statistics ----
Overall summary statistics displayed

Task 6.2: Summary Statistics by Firm¶

What you'll do: Create a table showing statistics for prccd and ret separately for each company.

Why this matters: Firm-level statistics reveal:

  • Which firms have higher/lower average prices
  • Which firms are more volatile (higher standard deviation of returns)
  • Differences in trading patterns across firms
  • Whether all firms have similar numbers of observations

What to expect:

  • Each firm will have ~1,300 observations (trading days)
  • Price levels will vary considerably across firms
  • Return volatility (standard deviation) should be similar across firms (all ~0.01-0.03)
  • You might notice Volkswagen has unusual patterns around September 2015 (Dieselgate)
Stata Stata Tip
Use tabstat prccd ret, by(conm) stat(n mean sd min max) to create grouped summary tables.
  • by(conm) groups by company name
  • stat() specifies which statistics to show: n (count), mean, sd (standard deviation), min, max
Summary statistics: N, Mean, SD, Min, Max
Group variable: conm (Company name)

            conm |     prccd       ret
-----------------+--------------------
BAYERISCHE MOTOR |      1304      1043
                 |  84.62069   .000237
                 |  10.50599  .0155601
                 |     63.93 -.0782709
                 |     122.6  .0703134
-----------------+--------------------
          MAN SE |      1304      1043
                 |  92.20431  .0002268
                 |  3.441024  .0047124
                 |     80.75 -.0352965
                 |     99.02  .0340444
-----------------+--------------------
MERCEDES BENZ GR |      1304      1043
                 |  65.24918  .0005072
                 |  10.82731  .0156527
                 |    38.645 -.0857544
                 |     95.79   .059825
-----------------+--------------------
   VOLKSWAGEN AG |      1304      1043
                 |  161.3286  .0002301
                 |  29.55823  .0181396
                 |    101.15 -.1842681
                 |    247.55  .0801751
-----------------+--------------------
           Total |      5216      4172
                 |  100.8507  .0003003
                 |  39.92719  .0144726
                 |    38.645 -.1842681
                 |    247.55  .0801751
--------------------------------------
---- CHECKPOINT: by-firm statistics ----
Firm-level summary statistics displayed
✅ Test passed: Summary statistics generated successfully.

Section 7: Save Processed Data¶

Task 7.1: Order Variables and Sort Data¶

What you'll do: Arrange the columns (variables) in a logical order and ensure rows are properly sorted.

Why this matters:

  • Column ordering: Putting important variables first (identifiers, then key variables, then derived variables) makes the dataset easier to browse and understand
  • Row sorting: Proper sorting (by firm, then by date) ensures:
    • Easier visual inspection
    • Correct functioning of time-series commands
    • Faster data access in some operations

Logical variable order:

  1. Identifiers: gvkey, conm
  2. Time variables: date, year, month
  3. Core variables: prccd, ret
  4. Other variables (if any)

What to expect: After ordering, when you list or browse the data, you'll see variables in a sensible sequence. The data will be sorted by firm (gvkey), then chronologically within each firm.

Stata Stata Tip
  • order var1 var2 var3 — moves specified variables to the leftmost positions in order
  • sort var1 var2 — sorts rows by var1, then by var2 within each var1 value
Note: xtset already sorted your data, but it's good practice to explicitly sort before saving.
---- CHECKPOINT: data organized ----
Dataset ordered and sorted
✅ Test passed: Data is properly sorted.

Task 7.2: Save as Stata Dataset¶

What you'll do: Save the cleaned, merged, and processed dataset in Stata's native .dta format.

Why this matters:

  • Preservation of work: All your data cleaning, variable creation, and labeling will be saved
  • Efficiency: .dta files load much faster than CSV files and preserve variable types, labels, and formats
  • Future assignments: This cleaned dataset will be the starting point for Assignment 2 (Data Visualization)

What you're saving:

  • 4 German automaker firms
  • ~5,200 daily observations (2013-2017)
  • Merged identifier and price data
  • Created variables (returns, year, month)
  • Panel structure declaration (xtset)
  • Variable labels and documentation

Where it's saved: data/processed/auto_firms_g_daily_clean.dta

What to expect: Stata will confirm the file was saved and show the number of observations and variables.

Stata Stata Tip
Use save "$processed/filename.dta", replace to save the dataset.
  • The replace option allows overwriting if the file already exists
  • Using the $processed global keeps your file organization clean
file
    /Users/casparm4/Github/rsm-data-analytics-in-finance-private/private/assi
    > gnments/01-assignment/data/processed/auto_firms_g_daily_clean.dta saved
---- CHECKPOINT: data saved ----
Dataset saved to: /Users/casparm4/Github/rsm-data-analytics-in-finance-private/
> private/assignments/01-assignment/data/processed/auto_firms_g_daily_clean.dta
Final dataset contains 5216 observations and 13 variables
✅ Test passed: File saved successfully in processed folder.
✅ Test passed: Final dataset has 5216 observations (expected range).

Summary¶

Congratulations! You have successfully:

✅ Loaded company identifier and price data from CSV files
✅ Converted date strings to proper Stata date format
✅ Cleaned the data by removing missing values
✅ Merged two datasets using company identifiers (m:1 merge)
✅ Declared panel data structure with xtset
✅ Created new variables for returns (using L. operator), year, and month
✅ Generated summary statistics overall and by firm
✅ Saved a clean, analysis-ready dataset

Key Concepts Learned¶

Data Management:

  • CSV import and Stata's single-dataset model
  • How to handle multiple datasets
    • Above and beyond: preserve/restore for temporary data operations
  • Many-to-one (m:1) merging

Panel Data:

  • What panel data is (firms × time)
  • Declaring panel structure with xtset
  • Using time-series operators (L., F., D.)

Financial Variables:

  • Log returns and why we use them
  • Date handling in Stata
  • Importance of data cleaning and validation

The processed dataset (auto_firms_g_daily_clean.dta) will be used in Assignment 2 for data visualization and in later assignments for regression analysis and event studies.


References¶

  • Stata Documentation
  • Stata Data Management Reference Manual
  • Stata Time-Series Reference Manual
  • Jupyter Notebooks
  • EPA Notice of Violation (2015)

Data Analytics for Finance

BM17FI · Academic Year 2025–26

Erasmus University Rotterdam

Created by: Caspar David Peter

© 2026 Rotterdam School of Management