Score: 0

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Published: December 10, 2025 | arXiv ID: 2512.09675v1

By: Leyi Pan , Shuchang Tao , Yunpeng Zhai and more

Potential Business Impact:

Helps AI solve math and logic puzzles better.

Business Areas:

A/B Testing Data and Analytics

Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Machine Learning (CS)

Teaches AI to learn better by focusing on tricky parts.

19 Nov 2025 1

90%

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Computation and Language

Teaches AI to write better by learning from mistakes.

3 Dec 2025 1

89%

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Machine Learning (CS)

Makes AI smarter and faster at thinking.

24 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

16 pages

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Helps AI solve math and logic puzzles better.

Technical Abstract

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling