RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation
By: Jiahao Zhao , Luxin Xu , Minghuan Tan and more
Potential Business Impact:
Tests AI to help doctors give safe medicine.
Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and professional quality, resulting in the benchmark RxSafeBench with 2,443 high-quality consultation scenarios. We evaluate leading open-source and proprietary LLMs using structured multiple choice questions that test their ability to recommend safe medications under simulated patient contexts. Results show that current LLMs struggle to integrate contraindication and interaction knowledge, especially when risks are implied rather than explicit. Our findings highlight key challenges in ensuring medication safety in LLM-based systems and provide insights into improving reliability through better prompting and task-specific tuning. RxSafeBench offers the first comprehensive benchmark for evaluating medication safety in LLMs, advancing safer and more trustworthy AI-driven clinical decision support.
Similar Papers
Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review
Computation and Language
Helps computers find mistakes in medicine orders.
Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions
Computation and Language
AI unfairly predicts drug side effects for some.
Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Artificial Intelligence
Makes AI doctors safer by catching bad advice.