StoryMem: Multi-shot Long Video Storytelling with Memory
By: Kaiwen Zhang , Liming Jiang , Angtian Wang and more
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
Similar Papers
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
CV and Pattern Recognition
Creates longer, connected stories in videos.
VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
CV and Pattern Recognition
Lets computers watch and remember long videos.
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
CV and Pattern Recognition
Makes videos tell stories with consistent characters.