Score: 1

Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Published: November 24, 2025 | arXiv ID: 2511.19220v1

By: Federico Felizzi , Olivia Riccomi , Michele Ferramola and more

Potential Business Impact:

Computers sometimes guess answers without looking.

Business Areas:

Visual Search Internet Services

Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

Human-Computer Interaction

Helps computers understand charts, but they still get tricked.

20 Mar 2025 1

90%

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

CV and Pattern Recognition

Helps doctors understand where body parts are.

1 Aug 2025 1

90%

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

CV and Pattern Recognition

Helps doctors understand where body parts are.

1 Aug 2025 1

View PDF Login to Bookmark

Page Count

11 pages

Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Computers sometimes guess answers without looking.

Technical Abstract

Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images