Score: 1

Just Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks

Published: July 21, 2025 | arXiv ID: 2507.15821v1

By: Hope Schroeder, Deb Roy, Jad Kabbara

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

AI suggestions change how people label things.

Plain English Summary

Imagine you're trying to teach a computer to understand human language, like figuring out if a movie review is positive or negative. This new research shows that when people get help from AI to label these reviews, they tend to agree with the AI a lot, even if there are many possible "right" answers. This can make it look like the AI is much better than it actually is, and it might mess up how we use that information later on. So, it's important to be careful when using AI to help label things, especially when the answers aren't black and white.

LLM use in annotation is becoming widespread, and given LLMs' overall promising performance and speed, simply "reviewing" LLM annotations in interpretive tasks can be tempting. In subjective annotation tasks with multiple plausible answers, reviewing LLM outputs can change the label distribution, impacting both the evaluation of LLM performance, and analysis using these labels in a social science task downstream. We conducted a pre-registered experiment with 410 unique annotators and over 7,000 annotations testing three AI assistance conditions against controls, using two models, and two datasets. We find that presenting crowdworkers with LLM-generated annotation suggestions did not make them faster, but did improve their self-reported confidence in the task. More importantly, annotators strongly took the LLM suggestions, significantly changing the label distribution compared to the baseline. When these labels created with LLM assistance are used to evaluate LLM performance, reported model performance significantly increases. We believe our work underlines the importance of understanding the impact of LLM-assisted annotation on subjective, qualitative tasks, on the creation of gold data for training and testing, and on the evaluation of NLP systems on subjective tasks.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Page Count
25 pages

Category
Computer Science:
Computers and Society