Taming Diffusion Transformer for Real-Time Mobile Video Generation
By: Yushu Wu , Yanyu Li , Anil Kag and more
Potential Business Impact:
Makes phone videos faster and better.
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.
Similar Papers
PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
CV and Pattern Recognition
Makes creating videos much faster.
Minute-Long Videos with Dual Parallelisms
CV and Pattern Recognition
Makes long videos faster to create.
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
CV and Pattern Recognition
Makes video creation AI run on small devices.