SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices
By: Dongting Hu , Aarush Gupta , Magzhan Gabidolla and more
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.
Similar Papers
Taming Diffusion Transformer for Real-Time Mobile Video Generation
CV and Pattern Recognition
Makes phone videos faster and better.
Designing Parameter and Compute Efficient Diffusion Transformers using Distillation
CV and Pattern Recognition
Makes big AI art tools work on small gadgets.
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention
CV and Pattern Recognition
Makes AI create better pictures faster.