Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
By: Yixiong Chen , Wenjie Xiao , Pedro R. A. S. Bassi and more
Potential Business Impact:
Helps doctors find tumors in 3D scans.
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.
Similar Papers
How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study
CV and Pattern Recognition
Helps computers understand medical pictures better.
MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis
CV and Pattern Recognition
Helps doctors measure body parts from X-rays.
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
CV and Pattern Recognition
Helps doctors understand 3D body scans with words.