Score: 1

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Published: June 3, 2025 | arXiv ID: 2506.03106v5

By: Xiaoying Zhang , Hao Sun , Yipeng Zhang and more

Potential Business Impact:

Helps computers learn better from mistakes and feedback.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

Machine Learning (CS)

Helps AI learn math faster from past mistakes.

18 Oct 2025 0

90%

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Computation and Language

Teaches AI to judge and fix its own answers.

28 Oct 2025 1

89%

Lessons from Training Grounded LLMs with Verifiable Rewards

Computation and Language

Makes AI answers more truthful and proven.

18 Jun 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

49 pages

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Helps computers learn better from mistakes and feedback.

Technical Abstract

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Lessons from Training Grounded LLMs with Verifiable Rewards