Jenny’s relief investigations and firm website content

ImportantTime for some action!

So now it’s time to put our money where our mouth is. Below I show how to import the relief investigation data from the Jones-esque analysis earlier and cleaned firm website text data from Haans and Mertens (2024), and to do a quick-and-dirty analysis to see if there is any action in the data.

Quick check if there is some action in the data

This can be a make-or-break moment for an empirical project. If there is no action in the data, i.e., no change in outcomes around the event of interest, there is no point in continuing the project. Still, this is a quick-and-dirty check, and more careful analysis is needed to draw any conclusions. Below I focus on treated firms only and plot average word counts of injury-related words over event time. If nothing is happening here, it is unlikely—but not impossible—that a more careful analysis will find anything, either.

Simple Bag-of-Words Analysis

There are far more sophisticated text analysis methods out there, but for a quick-and-dirty check, a simple bag-of-words approach should do. Below I count the occurrence of a set of (opinionated and AI-assisted) injury-related words in the website text of treated firms over time.

Code
# Bag of words

injury_words <- c("competition", "foreign", "imports", "challenge", "pressure", "difficult", "hurt", "injury", "loss", "unfair", "dumping", "subsidy", "threat")

treated_text_data_firm <- treated_text_data %>%
  mutate(injury_words_count = map(text, ~ str_to_lower(str_extract(.x, paste0(injury_words, collapse = "|"))))) %>%
  unnest(cols = c(injury_words_count)) %>%
  mutate(injury_words_count = ifelse(is.na(injury_words_count), 0, 1)) %>%
  group_by(gvkey, year) %>%
  summarize(
    total_injury_words = n(),
    total_words = sum(word_count)
  ) %>%
  mutate(
    injury_word_ratio = total_injury_words / total_words,
    year = as.integer(year)
  ) %>%
  ungroup()

# view(treated_text_data_firm)
# treated_text_data  %>% .[11123, ]  %>% pull(text)  %>% str_extract(., paste0(injury_words, collapse = "|"))
website_cohorts <- tbl(duckdb_conn, "em_cohorts") %>%
  collect() %>%
  # read_csv(here::here("data-prep", "em_cohorts.csv")) %>%
  # No adjustment of time/fiscal year here - calendar year and website data year should line up
  mutate(year = lubridate::year(datadate)) %>%
  right_join(treated_text_data_firm, by = c("gvkey", "year")) %>%
  filter(!is.na(time))
WarningWarning

#1:The analysis only covers the following event-years: 1995, 1996, 1999, 2000, 2021

#2: This is a quick-and-dirty analysis. Results should be interpreted with caution and not taken as definitive evidence of any trends or patterns.

Moment of Truth: Is there any action in the data—for treated firms?

Code
# Plot word counts over event time

website_cohorts %>%
  group_by(time) %>%
  summarize(
    avg_injury_word_ratio = mean(injury_word_ratio, na.rm = TRUE),
    sd_injury_word_ratio = sd(injury_word_ratio, na.rm = TRUE),
    n = n(),
    se_injury_word_ratio = sd_injury_word_ratio / sqrt(n),
    ci_lower = avg_injury_word_ratio - qt(0.975, n - 1) * se_injury_word_ratio,
    ci_upper = avg_injury_word_ratio + qt(0.975, n - 1) * se_injury_word_ratio
  ) %>%
  ggplot(aes(x = time, y = avg_injury_word_ratio)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2) +
  scale_y_continuous(limits = c(-0.035, 0.035)) +
  scale_x_continuous(breaks = seq(-3, 3, by = 1)) +
  labs(
    title = "Average Injury-Related Word Ratio Over Event Time",
    x = "Event Time (Years Since Investigation)",
    y = "Average Injury-Related Word Ratio"
  ) +
  ggthemes::theme_hc()

Simple correlation test between Jones residuals and injury-related word ratio: 0.06.

What about treated vs. control firms?

To take the analysis a step further, we can compare treated firms to control firms using a difference-in-differences (DiD) approach. Below I present summary statistics for key variables, followed by DiD estimates and visualizations of the results.

Moment of Truth II: Is there any action in the data—for treated AND control firms?

Code
# Plot word counts over event time

website_cohorts %>%
  group_by(treated, time) %>%
  mutate(treated = ifelse(treated == 1, "Treated", "Control")) %>%
  summarize(avg_injury_word_ratio = mean(injury_word_ratio, na.rm = TRUE)) %>%
  # sd_injury_word_ratio = sd(injury_word_ratio, na.rm = TRUE),
  # n = n(),
  # se_injury_word_ratio = sd_injury_word_ratio / sqrt(n),
  # ci_lower = avg_injury_word_ratio - qt(0.975, n - 1) * se_injury_word_ratio,
  # ci_upper = avg_injury_word_ratio + qt(0.975, n - 1) * se_injury_word_ratio)  %>%
  ggplot(aes(x = time, y = avg_injury_word_ratio, color = treated)) +
  geom_line() +
  geom_point() +
  # geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2) +
  scale_y_continuous(limits = c(-0.025, 0.025)) +
  scale_x_continuous(breaks = seq(-3, 3, by = 1)) +
  labs(
    title = "Average Injury-Related Word Ratio Over Event Time",
    x = "Event Time (Years Since Investigation)",
    y = "Average Injury-Related Word Ratio"
  ) +
  ggthemes::theme_hc()

DiD Estimation of Jones Residuals Around Import Relief Investigations

References

Haans, R.F., Mertens, M.J., 2024. The internet never forgets: A four-step scraping tutorial, codebase, and database for longitudinal organizational website data. Organizational Research Methods 10944281241284941.