Score: 2

Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

Published: August 1, 2025 | arXiv ID: 2508.00410v1

By: Zizhuo Zhang , Jianing Zhu , Xinmu Ge and more

Potential Business Impact:

Teaches computers to think better by comparing answers.

Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human annotated labels especially for complex tasks. Recent alternatives that explore various self-reward signals exhibit the eliciting potential of LLM reasoning, but suffer from the non-negligible collapse issue. Inspired by the success of self-supervised learning, we propose \textit{Co-Reward}, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis. Specifically, we construct a similar question for each training sample (without labels) and synthesize their individual surrogate labels through a simple rollout voting, and then the reward is constructed by cross-referring the labels of each question pair to enforce the internal reasoning consistency across analogical inputs. Intuitively, such a self-supervised reward-shaping mechanism increases the difficulty of learning collapse into a trivial solution, and promotes stable reasoning elicitation and improvement through expanding the input sample variants. Empirically, Co-Reward achieves superior performance compared to other self-reward baselines on multiple reasoning benchmarks and LLM series, and reaches or even surpasses ground-truth (GT) labeled reward, with improvements of up to $+6.8\%$ on MATH500 over GT reward on Llama-3.2-3B-Instruct. Our code is publicly available at https://github.com/tmlr-group/Co-Reward.

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Artificial Intelligence

Teaches computers to think better without help.

10 Jun 2025 4

91%

Coupled Variational Reinforcement Learning for Language Model General Reasoning

Computation and Language

Makes AI think better to solve problems.

14 Dec 2025 1

91%

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Computation and Language

Makes AI better at answering questions correctly.

7 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇨🇳 China, Hong Kong

Repos / Data Links

github.com github.com github.com github.com github.com github.com

Page Count

29 pages

Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

Teaches computers to think better by comparing answers.

Technical Abstract

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Coupled Variational Reinforcement Learning for Language Model General Reasoning

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models