Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering
By: Yuanhao Zou, Zhaozheng Yin
Potential Business Impact:
Helps doctors answer questions about medical pictures.
Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.
Similar Papers
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning
Artificial Intelligence
Helps doctors understand medical images better.
Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion
CV and Pattern Recognition
Helps doctors answer tricky questions about medical pictures.
Structure Causal Models and LLMs Integration in Medical Visual Question Answering
CV and Pattern Recognition
Helps doctors answer questions from medical pictures.