Score: 2

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Published: November 5, 2025 | arXiv ID: 2511.03441v1

By: Doria Bonzi , Alexandre Guiggi , Frédéric Béchet and more

Potential Business Impact:

Helps AI understand medical research better.

Business Areas:

Clinical Trials Health Care

Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Computation and Language

Helps doctors check if medical studies are good.

5 Nov 2025 2

91%

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Computation and Language

Tests AI for doctor-level medical answers.

4 Jun 2025 1

90%

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Computation and Language

Helps AI doctors explain their thinking better.

16 May 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

12 pages

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Helps AI understand medical research better.

Technical Abstract

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports