RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
By: Xichen Tan , Yunfan Ye , Yuanjing Luo and more
Potential Business Impact:
Tests AI video understanding better by picking key moments.
Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
Similar Papers
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
CV and Pattern Recognition
Helps computers understand long videos better.
E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
CV and Pattern Recognition
Makes computers understand long videos faster and better.
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Computation and Language
Helps computers answer questions about pictures in many languages.