Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space
By: Joey Hong , Kang Liu , Zhan Ling and more
Potential Business Impact:
Teaches computers to learn better by explaining mistakes.
Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
Similar Papers
Enhancing Decision-Making of Large Language Models via Actor-Critic
Computation and Language
Helps computers make better, long-term choices.
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
Computation and Language
Teaches AI to learn and solve problems better.
Learning Robust Social Strategies with Large Language Models
Machine Learning (CS)
Teaches AI to work together, not cheat.