Pretraining Frame Preservation in Autoregressive Video Memory Compression
By: Lvmin Zhang , Shengqu Cai , Muyang Li and more
We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
Similar Papers
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
CV and Pattern Recognition
Makes videos understandable for computers.
Adaptive High-Frequency Preprocessing for Video Coding
CV and Pattern Recognition
Makes videos look better and use less space.
Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
CV and Pattern Recognition
Keeps video stories consistent over long times.