Score: 1

Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Published: May 22, 2025 | arXiv ID: 2505.16470v1

By: Kuicai Dong , Yujing Chang , Shijie Huang and more

BigTech Affiliations: Huawei

Potential Business Impact:

Helps computers understand documents with text and pictures.

Business Areas:

Augmented Reality Hardware, Software

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.

MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

Multimedia

Answers questions from long, mixed-up documents.

1 Aug 2025 1

94%

MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

Multimedia

Helps computers answer questions using pictures and text.

1 Aug 2025 1

94%

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Artificial Intelligence

Helps computers understand pictures and text better.

24 Feb 2025 3

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

47 pages

Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Helps computers understand documents with text and pictures.

Technical Abstract

MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts