MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis
By: Yongcheng Yao , Yongshuo Zong , Raman Dutt and more
Potential Business Impact:
Helps doctors measure body parts from X-rays.
Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.
Similar Papers
MedM-VL: What Makes a Good Medical LVLM?
CV and Pattern Recognition
Helps doctors understand medical pictures better.
Vision Language Models in Medicine
CV and Pattern Recognition
Helps doctors understand medical images and notes.
Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks
Image and Video Processing
Helps computers find diseases in X-rays.