BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
By: Zeyu Zhang , Shuning Chang , Yuanyu He and more
Potential Business Impact:
Makes AI create long, clear videos that make sense.
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.
Similar Papers
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
CV and Pattern Recognition
Makes videos play longer without looking weird.
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
CV and Pattern Recognition
Creates longer, smoother, and more varied videos.
MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model
CV and Pattern Recognition
Makes videos smoother by guessing missing frames.