Evaluating LLMs in Medicine: A Call for Rigor, Transparency
By: Mahmoud Alwakeel , Aditya Nagori , Vijay Krishnamoorthy and more
Potential Business Impact:
Makes AI better at answering doctor questions.
Objectives: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Materials and Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools. Results: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. These gaps highlight the need for secure, comprehensive, and representative datasets. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities.
Similar Papers
Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs
Computation and Language
Helps doctors make better choices using smart computer programs.
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Computation and Language
Tests AI for doctor-level medical answers.
A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare
Computers and Society
Makes AI doctors trustworthy and safe for patients.