VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
By: Ziyang Zhang , Yang Yu , Xulei Yang and more
Potential Business Impact:
Helps doctors understand 3D scans better.
Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating large-scale paired data in the medical field for volumetric modalities such as CT scans remains a challenging and time-intensive process. This difficulty often limits the performance on downstream tasks. To address these challenges, we propose a novel vision-language pre-training (VLP) framework, termed as \textbf{VELVET-Med}, specifically designed for limited volumetric data such as 3D CT and associated radiology reports. Instead of relying on large-scale data collection, our method focuses on the development of effective pre-training objectives and model architectures. The key contributions are: 1) We incorporate uni-modal self-supervised learning into VLP framework, which are often underexplored in the existing literature. 2) We propose a novel language encoder, termed as \textbf{TriBERT}, for learning multi-level textual semantics. 3) We devise the hierarchical contrastive learning to capture multi-level vision-language correspondence. Using only 38,875 scan-report pairs, our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives, thereby enhancing the generalization ability of the learned encoders. The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks, including 3D segmentation, cross-modal retrieval, visual question answering, and report generation.
Similar Papers
Comprehensive language-image pre-training for 3D medical image understanding
CV and Pattern Recognition
Helps doctors find sickness in scans.
SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics
CV and Pattern Recognition
Helps doctors understand 3D body scans better.
More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
CV and Pattern Recognition
AI reads X-rays and reports for better medical AI.