Score: 0

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Published: March 20, 2025 | arXiv ID: 2503.15952v2

By: Chen Li, Nazhou Liu, Kai Yang

Potential Business Impact:

Makes AI smarter and faster at thinking.

Business Areas:

A/B Testing Data and Analytics

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains a simple but effective modification: a revised objective function to mitigate training fluctuation and zero advantage. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.