Score: 0

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

Published: April 21, 2025 | arXiv ID: 2504.15068v1

By: Ronak Pradeep , Nandan Thakur , Shivani Upadhyay and more

Potential Business Impact:

Tests AI answers automatically, saving time.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Computation and Language

Tests how AI uses outside facts to answer questions.

21 Apr 2025 0

92%

Knowledge-Graph Based RAG System Evaluation Framework

Computation and Language

Tests AI writing better by checking its thinking.

2 Oct 2025 2

91%

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Information Retrieval

Checks if AI answers have all the important facts.

28 Apr 2025 2

View PDF Login to Bookmark

Page Count

10 pages

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

Tests AI answers automatically, saving time.

Technical Abstract

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Knowledge-Graph Based RAG System Evaluation Framework

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses