FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching
By: Zhen Zou, Feng Zhao
Potential Business Impact:
Makes AI image creation faster and better.
Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration.
Similar Papers
FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching
Machine Learning (CS)
Makes AI image creation faster and use less memory.
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
CV and Pattern Recognition
Makes video creation much faster without losing quality.
Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
Audio and Speech Processing
Makes computer voices sound better, faster.