Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables?
By: Verónica Bäcker-Peral, Vitaly Meursault, Christopher Severen
Potential Business Impact:
Turns old paper records into useful computer data.
Multimodal LLMs offer a watershed change for the digitization of historical tables, enabling low-cost processing centered on domain expertise rather than technical skills. We rigorously validate an LLM-based pipeline on a new panel of historical county-level vehicle registrations. This pipeline is 100 times less expensive than outsourcing, reduces critical parsing errors from 40% to 0.3%, and matches human-validated gold standard data with an $R^2$ of 98.6%. Analyses of growth and persistence in vehicle adoption are statistically indistinguishable whether using LLM or gold standard data. LLM-based digitization unlocks complex historical tables, enabling new economic analyses and broader researcher participation.
Similar Papers
Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)
General Economics
Computers quickly read old, messy documents for history.
Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents
Computation and Language
Reads old German books better than ever before.
Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research
General Finance
Lets poor researchers get important data cheaply.