Group Sequence Policy Optimization
By: Chujie Zheng , Shixuan Liu , Mingze Li and more
Potential Business Impact:
Makes AI learn faster and better.
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Similar Papers
Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization
Multiagent Systems
Makes AI agents talk less, saving money.
SSPO: Subsentence-level Policy Optimization
Computation and Language
Makes AI smarter and learn from mistakes better.
ESPO: Entropy Importance Sampling Policy Optimization
Machine Learning (CS)
Makes AI better at solving math problems.