Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
By: Yujie Zhao , Lanxiang Hu , Yang Wang and more
Potential Business Impact:
Teaches AI to work together and solve harder problems.
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.
Similar Papers
Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
Machine Learning (CS)
Teaches AI to work together better for harder tasks.
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization
Machine Learning (CS)
Teaches AI to solve math problems step-by-step.
LLM Collaboration With Multi-Agent Reinforcement Learning
Artificial Intelligence
Helps AI agents work together to write and code.