Score: 0

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Published: December 23, 2025 | arXiv ID: 2512.20822v1

By: Zhan Qu, Michael Färber

Potential Business Impact:

Makes AI safer for doctors to use.

Business Areas:

Semantic Search Internet Services

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Computation and Language

Tests AI for doctor-level medical answers.

4 Jun 2025 1

90%

Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Artificial Intelligence

Helps AI doctors understand patient needs better.

13 Nov 2025 1

90%

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Computation and Language

Helps doctors check if medical studies are good.

5 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Page Count

15 pages

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Makes AI safer for doctors to use.

Technical Abstract

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field