JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
By: Bingxiang He , Zekai Qu , Zeyuan Liu and more
Potential Business Impact:
Makes smart computer programs learn better with less effort.
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
Similar Papers
The Art of Scaling Reinforcement Learning Compute for LLMs
Machine Learning (CS)
Helps AI learn better and faster.
A Technical Study into 0.5B Reasoning Language Models
Artificial Intelligence
Makes small AI smart enough for hard problems.
A Survey of Reinforcement Learning for Large Reasoning Models
Computation and Language
Teaches computers to think and solve hard problems.