Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
By: Sahel Sharifymoghaddam , Shivani Upadhyay , Nandan Thakur and more
Potential Business Impact:
Checks if AI answers have all the important facts.
Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, have emerged as a popular approach for assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a "good" response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations. All the code necessary to reproduce results in our work is available in https://github.com/castorini/lmsys_nuggetize.
Similar Papers
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Information Retrieval
Tests AI answers automatically, saving time.
Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Information Retrieval
Helps AI give better answers to your questions.
A Knowledge Graph and a Tripartite Evaluation Framework Make Retrieval-Augmented Generation Scalable and Transparent
Information Retrieval
Chatbots answer questions more accurately and reliably.