Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications
By: Leila Tavakoli, Hamed Zamani
Potential Business Impact:
Helps computers label things better with human help.
Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.
Similar Papers
Just Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks
Computers and Society
AI suggestions change how people label things.
LLMs as Data Annotators: How Close Are We to Human Performance
Computation and Language
Finds best examples to teach computers faster.
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
Computation and Language
Helps computers judge online hate speech better.