Score: 2

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Published: October 31, 2025 | arXiv ID: 2510.27267v1

By: Kangkun Mao , Jinru Ding , Jiayuan Chen and more

Potential Business Impact:

Helps AI doctors do math for patient care.

Business Areas:

Simulation Software

As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Computation and Language

Tests AI for doctor-level medical answers.

4 Jun 2025 1

90%

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

Computation and Language

Makes AI better at medical math for doctors.

20 Sep 2025 0

89%

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Computation and Language

Helps doctors check if medical studies are good.

5 Nov 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

25 pages

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Helps AI doctors do math for patient care.

Technical Abstract

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field