Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
By: Jucheng Shen , Gaurav Sarkar , Yeonju Ro and more
Potential Business Impact:
Makes AI write and create much faster.
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
Similar Papers
CDLM: Consistency Diffusion Language Models For Faster Sampling
Machine Learning (CS)
Makes AI write and code much faster.
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
Machine Learning (CS)
Makes AI text generators work much faster.
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits
Computation and Language
Makes AI write faster and better.