The Internet Never Forgets - Research Summary
Paper Summary
Research Question
The paper investigates how longitudinal website data can be systematically harvested and leveraged to study organizational discourse and strategy. It addresses the knowledge gap that, despite the abundance of archived webpages, researchers lack a reliable, scalable pipeline and a comprehensive, open‑access database to conduct large‑scale, time‑series analyses of firm communication.
Contribution
The authors deliver an open‑source Python codebase that implements a four‑step scraping workflow—sample construction, Wayback Machine retrieval, preprocessing, and analysis—yielding the CompuCrawl database. This resource contains 11,277 North American firms, 86,303 firm‑year observations, and 1.6 million webpages (1996‑2020). The paper also demonstrates the database’s utility through a word‑embedding study of the evolving meaning of “sustainability,” and outlines future research avenues enabled by multimodal, historical web data.
Theory
The study is grounded in data‑source theory (Landers et al., 2016), which guides the assessment of archival data suitability for research. It also draws on conceptual frameworks of organizational identity, distinctiveness, and strategic group formation, operationalized through topic modeling and word‑embedding techniques that capture semantic shifts over time.
Economic Mechanism
Website content is treated as a proxy for organizational discourse; changes in the language and structure of webpages reflect shifts in strategic priorities and stakeholder communication. The causal logic posits that as firms evolve their public messaging—e.g., increasing emphasis on sustainability—their website text will exhibit measurable changes in word semantics and topic prevalence, which can be quantified through embeddings and topic models.
Research Design
The empirical approach combines archival data collection with advanced text analytics. Firm identifiers are drawn from Compustat North America; archived webpages are retrieved from the Wayback Machine for 1996‑2020. After rigorous cleaning (removal of invalid, non‑English, and short pages) and classification via GPT‑3.5, the authors aggregate front‑page and sub‑page texts into firm‑year observations. They then apply a correlated topic model (125 topics) and continuous bag‑of‑words word‑embedding models, aligning embeddings across five‑year periods with orthogonal Procrustes to track semantic evolution.
Results
The CompuCrawl database contains 1,617,675 webpages and 636,318,656 words across 86,303 firm‑year observations. The word‑embedding analysis shows that the meaning of “sustainability” shifts markedly over time (cosine similarity rising from 0.301 in 2001‑2005 to 0.916 in 2016‑2020), whereas “profitability” remains stable (similarity > 0.74 across all periods). Topic modeling yields 125 coherent topics, and GPT‑3.5 classification achieves a Krippendorff’s α of 0.79 against human coders. Coverage analysis indicates that missingness due to non‑archived websites is largely random and does not bias the sample.
Conclusion
Longitudinal website data, when systematically harvested and processed, offers a rich, underexploited source for organizational research. The CompuCrawl database and accompanying codebase enable scholars to examine temporal changes in corporate discourse, identity, and strategic positioning at scale. The paper demonstrates that word‑embedding and topic‑modeling techniques can uncover meaningful semantic shifts, and it highlights future opportunities for multimodal analysis, generative‑AI classification, and comparative studies of strategic groups and distinctiveness.
Generated by Paper Summarizer using Ollama (Model: gpt-oss:20b)