Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
By: Jiaqi Deng , Kaize Shi , Zonghan Wu and more
Potential Business Impact:
Helps computers answer questions using pictures and facts.
Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.
Similar Papers
A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
CV and Pattern Recognition
Helps computers answer questions using pictures and facts.
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering
CV and Pattern Recognition
Helps computers answer questions using pictures and facts.
A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
CV and Pattern Recognition
Helps computers answer questions by focusing on useful facts.