Score: 1

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Published: October 16, 2025 | arXiv ID: 2510.14967v1

By: Guoqing Wang , Sunhao Dai , Guangze Ye and more

Potential Business Impact:

Teaches AI to learn better from each step.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

Machine Learning (CS)

Teaches AI to solve math problems step-by-step.

18 Nov 2025 2

90%

Group-in-Group Policy Optimization for LLM Agent Training

Machine Learning (CS)

Helps AI agents learn better from many steps.

16 May 2025 1

90%

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Machine Learning (CS)

Teaches AI to plan long tasks better, faster.

24 Sep 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

20 pages

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Teaches AI to learn better from each step.

Technical Abstract

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

Group-in-Group Policy Optimization for LLM Agent Training

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning