Score: 0

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Published: August 24, 2025 | arXiv ID: 2508.17445v1

By: Yizhi Li , Qingshui Gu , Zhoufutu Wen and more

Potential Business Impact:

Makes AI smarter and faster at thinking.

Business Areas:

A/B Testing Data and Analytics

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Artificial Intelligence

Teaches computers to learn better from choices.

11 Sep 2025 2

89%

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Computation and Language

Helps AI solve math and logic puzzles better.

10 Dec 2025 0

89%

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Machine Learning (CS)

Makes AI smarter by learning from its own mistakes.

2 Dec 2025 0

View PDF Login to Bookmark

Page Count

16 pages

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Makes AI smarter and faster at thinking.

Technical Abstract

Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization