VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models
By: Haidong Xu , Guangwei Xu , Zhedong Zheng and more
Potential Business Impact:
Makes computer characters move more realistically from videos.
This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
Similar Papers
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
CV and Pattern Recognition
Makes computer-made videos move more realistically.
E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
CV and Pattern Recognition
Makes computers understand long videos faster and better.
MV-RAG: Retrieval Augmented Multiview Diffusion
CV and Pattern Recognition
Makes 3D objects from rare ideas.