PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
By: Denis Korzhenkov , Adil Karjauv , Animesh Karnewar and more
Potential Business Impact:
Makes videos look real with less computer power.
Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.
Similar Papers
TPDiff: Temporal Pyramid Video Diffusion Model
CV and Pattern Recognition
Makes video creation faster and cheaper.
From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model
CV and Pattern Recognition
Makes AI create detailed pictures much faster.
Decentralized Diffusion Models
CV and Pattern Recognition
Trains AI art models cheaper and faster.