Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
By: Xiaoying Zhang , Hao Sun , Yipeng Zhang and more
Potential Business Impact:
Helps computers learn better from mistakes and feedback.
Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.
Similar Papers
LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs
Machine Learning (CS)
Helps AI learn math faster from past mistakes.
Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Computation and Language
Teaches AI to judge and fix its own answers.
Lessons from Training Grounded LLMs with Verifiable Rewards
Computation and Language
Makes AI answers more truthful and proven.