APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
By: Hong Gao , Yiming Bao , Xuezhen Tu and more
Potential Business Impact:
Lets computers watch and understand long videos.
Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose \textbf{A}daptive \textbf{P}ivot \textbf{V}isual information \textbf{R}etrieval (\textbf{APVR}), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations demonstrate significant performance improvements, achieving 64.9\% on LongVideoBench and 68.4\% on VideoMME, which are state-of-the-art results for both training-free and training-based approaches. Meanwhile, our method provides plug-and-play integration capability with existing MLLM architectures.
Similar Papers
Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
CV and Pattern Recognition
Find moments in long videos with just words.
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
CV and Pattern Recognition
Helps computers watch long videos faster.
E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
CV and Pattern Recognition
Makes computers understand long videos faster and better.