Graph-Enhanced Policy Optimization in LLM Agent Training
By: Jiazhen Yuan, Wei Zhao, Zhengbiao Bai
Potential Business Impact:
Teaches AI to learn better by seeing connections.
Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.
Similar Papers
Group-in-Group Policy Optimization for LLM Agent Training
Machine Learning (CS)
Helps AI agents learn better from many steps.
GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
Machine Learning (CS)
Trains smart computer programs far apart.
GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization
Artificial Intelligence
Makes reading legal papers faster and easier.