Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks
By: Gustav Müller-Franzes , Debora Jutz , Jakob Nikolas Kather and more
Potential Business Impact:
Helps computers find diseases in X-rays.
This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p<.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p<.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p<.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.
Similar Papers
MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis
CV and Pattern Recognition
Helps doctors measure body parts from X-rays.
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
CV and Pattern Recognition
Helps doctors understand 3D body scans with words.
Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports
Computers and Society
Helps doctors quickly find heart problems from scans.