Score: 1

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Published: May 27, 2025 | arXiv ID: 2505.21467v1

By: Zhanqiu Hu , Jian Meng , Yash Akhauri and more

Potential Business Impact:

Makes AI talk faster without losing quality.

Business Areas:

Autonomous Vehicles Transportation

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Computation and Language

Makes AI write words much faster and better.

28 May 2025 0

90%

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Machine Learning (CS)

Makes AI text generators work much faster.

17 May 2025 1

90%

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Computation and Language

Makes AI write faster and smarter.

13 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

15 pages

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Makes AI talk faster without losing quality.

Technical Abstract

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States