Boosting Medical Visual Understanding From Multi-Granular Language Learning
By: Zihan Li , Yiqing Wang , Sina Farsiu and more
Potential Business Impact:
Helps doctors understand many medical images better.
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.
Similar Papers
PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
CV and Pattern Recognition
Helps computers understand images and long text better.
Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset
CV and Pattern Recognition
Helps doctors find sickness in X-rays by words.
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
CV and Pattern Recognition
Helps doctors find diseases in pictures faster.