Score: 0

Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

Published: September 3, 2025 | arXiv ID: 2509.04523v1

By: David Anderson , Galia Benitez , Margret Bjarnadottir and more

Potential Business Impact:

Helps understand Colombia's past violence from news.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia's historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.

Ground Truth Generation for Multilingual Historical NLP using LLMs

Computation and Language

Helps computers understand old books and writings.

18 Nov 2025 1

89%

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research

General Finance

Lets poor researchers get important data cheaply.

3 Dec 2024 1

89%

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research

General Finance

Lets researchers get needed data for free.

3 Dec 2024 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

23 pages

Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

Helps understand Colombia's past violence from news.

Technical Abstract

Ground Truth Generation for Multilingual Historical NLP using LLMs

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research