Score: 0

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Published: January 13, 2026 | arXiv ID: 2601.08468v1

By: Jiangshan Duo , Hanyu Li , Hailin Zhang and more

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Artificial Intelligence

Helps computers learn to solve harder problems.

29 Oct 2025 0

90%

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Computation and Language

Helps computers check answers using math.

27 Oct 2025 1

90%

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Artificial Intelligence

Fixes AI reasoning errors by focusing on hard problems.

2 Oct 2025 1

View PDF Login to Bookmark

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Technical Abstract

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models