LLM Sensitivity Evaluation Framework for Clinical Diagnosis
By: Chenwei Yan , Xiangling Fu , Yuxuan Xiong and more
Potential Business Impact:
Helps doctors diagnose illnesses more accurately.
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
Similar Papers
The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness
Computation and Language
Helps doctors diagnose illnesses more accurately.
Evaluating Large Language Models for Evidence-Based Clinical Question Answering
Computation and Language
Helps doctors answer patient questions better.
Performance of Large Language Models in Supporting Medical Diagnosis and Treatment
Computation and Language
AI helps doctors diagnose illnesses and plan treatments.