Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
By: Tianrui Zhu , Shiyi Zhang , Zhirui Sun and more
Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
Similar Papers
VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
CV and Pattern Recognition
Lets computers watch and remember long videos.
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
CV and Pattern Recognition
Creates longer, smoother, and more varied videos.
Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft
CV and Pattern Recognition
Makes game worlds remember past actions for better play.