FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
By: Jifeng Song , Arun Das , Pan Wang and more
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
Similar Papers
From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
CV and Pattern Recognition
Helps doctors understand medical pictures better.
Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge
Computation and Language
Writes better picture descriptions for science papers.
Learning complete and explainable visual representations from itemized text supervision
CV and Pattern Recognition
Helps doctors see hidden problems in medical scans.