InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
By: Haotian Ye , Qiyuan He , Jiaqi Han and more
Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.
Similar Papers
Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
CV and Pattern Recognition
Makes videos smaller for faster AI creation.
UniComp: Rethinking Video Compression Through Informational Uniqueness
CV and Pattern Recognition
Makes videos smaller while keeping important parts clear.
SFTok: Bridging the Performance Gap in Discrete Tokenizers
CV and Pattern Recognition
Makes pictures look better with fewer details.