Score: 1

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Published: September 22, 2025 | arXiv ID: 2509.18015v1

By: Advait Gosai , Arun Kavishwar , Stephanie L. McNamara and more

BigTech Affiliations: University of California, Berkeley

Potential Business Impact:

AI can find sickness in X-rays.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

CV and Pattern Recognition

Helps doctors find sickness on X-rays faster.

4 Jul 2025 0

91%

MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

CV and Pattern Recognition

AI finds diseases in scans better than GPT-4.

29 Dec 2025 0

90%

Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs

Computation and Language

Helps doctors find cancer details from reports.

3 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

AI can find sickness in X-rays.

Technical Abstract

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs