Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs
By: Qing Ding , Eric Hua Qing Zhang , Felix Jozsa and more
Potential Business Impact:
Helps doctors use AI to follow health rules.
Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.
Similar Papers
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models
Computation and Language
Helps doctors follow medical rules for patients.
Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
Computation and Language
Helps doctors write patient notes faster.
Performance of Large Language Models in Supporting Medical Diagnosis and Treatment
Computation and Language
AI helps doctors diagnose illnesses and plan treatments.