MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
By: Kangkun Mao , Jinru Ding , Jiayuan Chen and more
Potential Business Impact:
Helps AI doctors do math for patient care.
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
Similar Papers
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Computation and Language
Tests AI for doctor-level medical answers.
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
Computation and Language
Makes AI better at medical math for doctors.
CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Computation and Language
Helps doctors check if medical studies are good.