Score: 1

MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Published: September 8, 2025 | arXiv ID: 2509.07135v1

By: Ruggero Marino Lazzaroni , Alessandro Angioi , Michelangelo Puliga and more

Potential Business Impact:

Tests AI on Italian doctor school exams.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Computation and Language

Tests AI to see if it's safe for doctors.

18 Nov 2025 0

91%

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Computation and Language

Tests AI to see if it's safe for doctors.

18 Nov 2025 0

90%

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Computation and Language

Tests AI for doctor-level medical answers.

4 Jun 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

12 pages

MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Tests AI on Italian doctor school exams.

Technical Abstract

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation