Revisiting Human-vs-LLM judgments using the TREC Podcast Track
By: Watheq Mansour , J. Shane Culpepper , Joel Mackenzie and more
Potential Business Impact:
Helps computers understand spoken words better.
Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of pairs where LLM and TREC assessors have the highest disagreement, and found that the human experts tend to agree with LLMs more than with the TREC assessors. Our results reinforce the previous insights of Sormunen in 2002 -- that relying on a single assessor leads to lower user agreement.
Similar Papers
Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis
Information Retrieval
Finds where AI makes mistakes judging information.
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Information Retrieval
AI judges podcast picks like a person.
Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Information Retrieval
Computers can now judge search results faster than people.