Score: 1

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Published: December 3, 2025 | arXiv ID: 2512.03759v1

By: Jingyang Ou , Jiaqi Han , Minkai Xu and more

Potential Business Impact:

Teaches AI to write better by learning from mistakes.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Machine Learning (CS)

Teaches AI to learn better by focusing on tricky parts.

19 Nov 2025 1

90%

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Machine Learning (CS)

Teaches AI to solve math and code problems better.

9 Oct 2025 1

90%

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Computation and Language

Helps AI solve math and logic puzzles better.

10 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

26 pages

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Teaches AI to write better by learning from mistakes.

Technical Abstract

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models