Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation
By: Krisztian Balog, Donald Metzler, Zhen Qin
Potential Business Impact:
AI judges unfairly favor AI-ranked results.
Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
Similar Papers
Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Information Retrieval
Computers can now judge search results faster than people.
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Information Retrieval
AI judges might trick us into thinking systems are good.
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Artificial Intelligence
Makes AI judges fairer and more trustworthy.