Diffusion Language Models are Provably Optimal Parallel Samplers
By: Haozhe Jiang, Nika Haghtalab, Lijie Chen
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more expressive than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient parallel sampler, but also advocate for enabling revision in DLMs.
Similar Papers
A Survey on Diffusion Language Models
Computation and Language
Makes computers write faster and understand better.
CDLM: Consistency Diffusion Language Models For Faster Sampling
Machine Learning (CS)
Makes AI write and code much faster.
Diffusion Language Model Inference with Monte Carlo Tree Search
Computation and Language
Makes AI write better by finding best word choices.