Score: 1

MV-RAG: Retrieval Augmented Multiview Diffusion

Published: August 22, 2025 | arXiv ID: 2508.16577v1

By: Yosef Dayani, Omer Benishu, Sagie Benaim

Potential Business Impact:

Makes 3D objects from rare ideas.

Business Areas:

Augmented Reality Hardware, Software

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Computation and Language

Helps computers understand pictures and text in documents.

14 Apr 2025 0

90%

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

CV and Pattern Recognition

Lets you design clothes by just describing them.

18 Apr 2025 0

90%

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

CV and Pattern Recognition

Makes computer characters move more realistically from videos.

16 Aug 2025 1

View PDF Login to Bookmark

Page Count

23 pages

MV-RAG: Retrieval Augmented Multiview Diffusion

Makes 3D objects from rare ideas.

Technical Abstract

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models