From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
By: Kun Yuan , Min Woo Sun , Zhen Chen and more
Potential Business Impact:
Helps doctors understand medical pictures better.
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
Similar Papers
A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?
CV and Pattern Recognition
Helps doctors understand X-rays better with new AI.
Comprehensive language-image pre-training for 3D medical image understanding
CV and Pattern Recognition
Helps doctors find sickness in scans.
Distribution-Based Masked Medical Vision-Language Model Using Structured Reports
CV and Pattern Recognition
Helps doctors understand X-rays better by reading reports.