Score: 1

ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Published: December 29, 2025 | arXiv ID: 2512.23440v1

By: Yuqi Tang , Jing Yu , Zichang Su and more

Potential Business Impact:

Tests how well AI doctors can figure out illnesses.

Business Areas:

Health Diagnostics Health Care

Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

Computation and Language

Helps doctors find sickness faster by asking questions.

16 Jun 2025 0

88%

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Computation and Language

Makes AI safer for doctors to use.

23 Dec 2025 0

88%

Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

Computation and Language

Tests AI doctors on real patient problems.

10 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

23 pages

ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Tests how well AI doctors can figure out illnesses.

Technical Abstract

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation