Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones
By: Ranfei Chen, Ming Chen, Kaifei Wang
Potential Business Impact:
Teaches AI to learn better by focusing on tricky parts.
Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.
Similar Papers
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Computation and Language
Teaches AI to write better by learning from mistakes.
Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
Machine Learning (CS)
Teaches AI to solve math and code problems better.
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
Machine Learning (CS)
Makes AI better at math, code, and planning.