RLSR: Reinforcement Learning from Self Reward
By: Toby Simonds , Kevin Lopez , Akira Yoshiyama and more
Potential Business Impact:
AI learns to solve problems by checking its own work.
Large language models can generate solutions to complex problems, but training them with reinforcement learning typically requires verifiable rewards that are expensive to create and not possible for all domains. We demonstrate that LLMs can effectively self-improve through self-judging without reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains where verifiable rewards are impractical. By implementing self-judging across Countdown puzzles and integration problems, we achieve performance comparable to formal verification without ground truth solutions. Most notably, Qwen 2.5 7B DeepSeek Distilled trained with self-rewards qualifies for the prestigious MIT Integration Bee competition, performance through self-supervised improvement. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance without any external validation. Our findings demonstrate that LLM judges can provide effective reward signals for training, unlocking reinforcement learning in countless domains previously limited by reward engineering challenges. This work represents a significant step toward autonomous AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress across domains where training data is scarce or evaluation is complex.
Similar Papers
Can Large Reasoning Models Self-Train?
Machine Learning (CS)
Teaches computers math without needing answers.
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Computation and Language
Teaches computers to learn from their mistakes.
Process-based Self-Rewarding Language Models
Computation and Language
Teaches computers to solve math problems better.