Towards a rigorous evaluation of RAG systems: the challenge of due diligence
By: Grégoire Martinon , Alexandra Lorenzo de Brionne , Jérôme Bohard and more
Potential Business Impact:
Makes AI give more trustworthy answers for important jobs.
The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further analysis. Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications.
Similar Papers
A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions
Computation and Language
Makes AI answers more truthful and up-to-date.
Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines
Computation and Language
Helps doctors find medical advice fast.
Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey
Computation and Language
Tests how AI uses outside facts to answer questions.