When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare
By: Saeedeh Javadi , Sara Mirabi , Manan Gangar and more
Potential Business Impact:
Makes AI give better medicine answers.
In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.
Similar Papers
Retrieval Augmented Generation Evaluation for Health Documents
Information Retrieval
Helps doctors find important health info faster.
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
Information Retrieval
Keeps AI from spreading fake health news.
Retrieval-Augmented Generation with Conflicting Evidence
Computation and Language
AI agents debate to find true answers.