Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
By: Federico Felizzi , Olivia Riccomi , Michele Ferramola and more
Potential Business Impact:
Computers sometimes guess answers without looking.
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
Similar Papers
Benchmarking Visual Language Models on Standardized Visualization Literacy Tests
Human-Computer Interaction
Helps computers understand charts, but they still get tricked.
Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images
CV and Pattern Recognition
Helps doctors understand where body parts are.
Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images
CV and Pattern Recognition
Helps doctors understand where body parts are.