Score: 1

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Published: December 15, 2025 | arXiv ID: 2512.13070v1

By: Bizhe Bai , Hongming Wu , Peng Ye and more

Potential Business Impact:

Makes AI smarter and learn longer without mistakes.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Machine Learning (CS)

Makes AI give the same answers every time.

14 Dec 2025 0

91%

Training-Free Group Relative Policy Optimization

Computation and Language

Teaches computers to solve new problems better.

9 Oct 2025 2

91%

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Computation and Language

Teaches computers to solve harder math problems.

22 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Makes AI smarter and learn longer without mistakes.

Technical Abstract

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Training-Free Group Relative Policy Optimization

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning