Data Quality Challenges in Retrieval-Augmented Generation
By: Leopold Müller , Joshua Holstein , Sarah Bause and more
Potential Business Impact:
Improves AI's answers by checking its information.
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
Similar Papers
Enhancing Retrieval-Augmented Generation: A Study of Best Practices
Computation and Language
Makes AI smarter by giving it better information.
Retrieval-Augmented Generation in Industry: An Interview Study on Use Cases, Requirements, Challenges, and Evaluation
Information Retrieval
Helps AI answer questions using real-world facts.
Domain-Specific Data Generation Framework for RAG Adaptation
Computation and Language
Helps AI learn from specific books and documents.