Score: 0

RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Published: December 30, 2025 | arXiv ID: 2512.24086v1

By: Aiyue Chen , Yaofu Liu , Junjian Huang and more

In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

CV and Pattern Recognition

Makes AI video creation much faster.

27 May 2025 1

90%

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

CV and Pattern Recognition

Makes long videos create faster without losing quality.

18 Aug 2025 0

90%

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

CV and Pattern Recognition

Makes AI create videos much faster.

3 Jun 2025 2

View PDF Login to Bookmark

RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Technical Abstract

RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers