Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
By: Xiangrui Liu , Yan Shu , Zheng Liu and more
Potential Business Impact:
Lets computers watch and understand very long videos.
Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.
Similar Papers
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
CV and Pattern Recognition
Makes computers understand videos using less data.
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
CV and Pattern Recognition
Makes videos play faster without losing quality.
Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
CV and Pattern Recognition
Makes videos smaller for faster AI creation.