Score: 2

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Published: November 28, 2025 | arXiv ID: 2511.22843v1

By: Dosung Lee , Sangwon Jung , Boyoung Kim and more

BigTech Affiliations: Amazon

Potential Business Impact:

Teaches computers to answer questions about images better.

Business Areas:

Visual Search Internet Services

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

CV and Pattern Recognition

Makes news captions understand pictures better.

26 Nov 2025 2

90%

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

CV and Pattern Recognition

Helps computers find answers in any language document.

10 Aug 2025 1

90%

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

CV and Pattern Recognition

Helps computers answer questions using pictures and facts.

16 Oct 2025 3

View PDF Login to Bookmark

Country of Origin

🇰🇷 🇺🇸 United States, Korea, Republic of

Page Count

15 pages

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Teaches computers to answer questions about images better.

Technical Abstract

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering