Score: 0

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Published: December 18, 2025 | arXiv ID: 2512.17008v1

By: Junbo Li , Peng Zhou , Rui Meng and more

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

Machine Learning (CS)

Teaches AI to solve math problems step-by-step.

18 Nov 2025 2

92%

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Machine Learning (CS)

Makes AI better at talking and solving problems.

25 Nov 2025 0

92%

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Machine Learning (CS)

AI chats learn to finish tasks better.

26 Nov 2025 1

View PDF Login to Bookmark

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Technical Abstract

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO