USV: Unified Sparsification for Accelerating Video Diffusion Models
By: Xinjian Wu , Hongmei Wang , Yuan Zhou and more
The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.
Similar Papers
VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling
CV and Pattern Recognition
Makes videos of moving things look clearer.
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
CV and Pattern Recognition
Makes blurry videos sharp using text and pictures.
FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation
CV and Pattern Recognition
Makes 3D pictures from few photos faster.