AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
By: Zhiheng Xi , Chenyang Liao , Guanyu Li and more
Potential Business Impact:
Helps AI make better choices step-by-step.
Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
Similar Papers
A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Computation and Language
Teaches computers to think step-by-step.
The Bidirectional Process Reward Model
Computation and Language
Helps AI check its thinking both ways.
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Computation and Language
Fixes math problems by explaining each step.