FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
By: Liang Qiao , Yue Dai , Yeqi Huang and more
Potential Business Impact:
Makes AI image makers run much faster.
Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end acceleration without degrading visual quality.
Similar Papers
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers
CV and Pattern Recognition
Makes AI create videos much faster.
Bidirectional Sparse Attention for Faster Video Diffusion Training
CV and Pattern Recognition
Makes video creation faster and cheaper.
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention
CV and Pattern Recognition
Makes AI create better pictures faster.