Score: 0

A Multi-Granularity Retrieval Framework for Visually-Rich Documents

Published: May 1, 2025 | arXiv ID: 2505.01457v2

By: Mingjun Xu , Zehui Wang , Hengxing Cai and more

Potential Business Impact:

Helps computers understand pictures and words together.

Business Areas:

Visual Search Internet Services

Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybrid retrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and VLM-based candidate verification significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.

OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

Information Retrieval

Helps computers answer questions about pictures.

10 May 2025 1

94%

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Computation and Language

Helps computers understand pictures and text in documents.

14 Apr 2025 0

93%

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Computation and Language

Helps computers understand all parts of documents.

17 Oct 2025 0

View PDF Login to Bookmark

Page Count

3 pages

A Multi-Granularity Retrieval Framework for Visually-Rich Documents

Helps computers understand pictures and words together.

Technical Abstract

OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding