Score: 1

Taming Diffusion Transformer for Real-Time Mobile Video Generation

Published: July 17, 2025 | arXiv ID: 2507.13343v1

By: Yushu Wu , Yanyu Li , Anil Kag and more

Potential Business Impact:

Makes phone videos faster and better.

Business Areas:
Autonomous Vehicles Transportation

Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.

Page Count
13 pages

Category
Computer Science:
CV and Pattern Recognition