Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
By: Xiantao Zhang
Potential Business Impact:
Helps computers understand pictures and text together.
Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.
Similar Papers
A Multi-Granularity Retrieval Framework for Visually-Rich Documents
Information Retrieval
Helps computers understand pictures and words together.
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
CV and Pattern Recognition
Helps computers understand pictures with words.
Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Computation and Language
Lets computers understand pictures in reports.