dInfer: An Efficient Inference Framework for Diffusion Language Models
By: Yuxin Ma , Lun Du , Lanning Wei and more
Potential Business Impact:
Makes AI write much faster and better.
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.
Similar Papers
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Machine Learning (CS)
Makes AI write much faster than before.
EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients
Artificial Intelligence
Makes AI think faster and use less energy.
A Survey on Diffusion Language Models
Computation and Language
Makes computers write faster and understand better.