Score: 1

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Published: March 3, 2025 | arXiv ID: 2503.01222v2

By: Wenbin Wang , Yongcheng Jing , Liang Ding and more

Potential Business Impact:

Lets computers see tiny details in pictures.

Business Areas:

Augmented Reality Hardware, Software

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

CV and Pattern Recognition

Helps computers "see" and create pictures better.

23 Mar 2025 1

91%

Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Computation and Language

AI learns to find better information for itself.

28 Oct 2025 1

91%

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Computation and Language

Helps computers understand all parts of documents.

17 Oct 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

16 pages

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Lets computers see tiny details in pictures.

Technical Abstract

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding