Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
By: Chen Li, Nazhou Liu, Kai Yang
Potential Business Impact:
Makes AI smarter and faster at thinking.
Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains a simple but effective modification: a revised objective function to mitigate training fluctuation and zero advantage. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.
Similar Papers
Training-Free Group Relative Policy Optimization
Computation and Language
Teaches computers to solve new problems better.
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Machine Learning (CS)
Helps AI learn from mistakes, not just successes.
Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models
Machine Learning (CS)
Teaches computers to fix their own mistakes.