Score: 0

A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Published: October 14, 2025 | arXiv ID: 2510.12306v1

By: Cameron Morin, Matti Marttinen Larsson

Potential Business Impact:

Lets computers sort words for language study.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.

Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis

Computation and Language

AI helps understand language rules in many languages.

28 Nov 2025 0

90%

A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Computation and Language

Computers learn to understand conversations automatically.

11 Apr 2025 1

90%

Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Computation and Language

Helps computers understand stories of past injustices.

8 Aug 2025 1

View PDF Login to Bookmark

Page Count

26 pages

A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Lets computers sort words for language study.

Technical Abstract

Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis

A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis