ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
By: Alberto Compagnoni , Marco Morini , Sara Sarto and more
Potential Business Impact:
Helps AI answer hard questions using extra facts.
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
Similar Papers
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
CV and Pattern Recognition
Helps computers answer questions about pictures better.
KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering
Computation and Language
Helps AI answer questions more accurately using more facts.
Multimodal Iterative RAG for Knowledge Visual Question Answering
CV and Pattern Recognition
Helps computers answer harder questions using more information.