Score: 3

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Published: October 29, 2025 | arXiv ID: 2510.25992v1

By: Yihe Deng , I-Hung Hsu , Jun Yan and more

BigTech Affiliations: Google

Potential Business Impact:

Teaches computers to solve hard problems step-by-step.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Artificial Intelligence

Teaches computers to learn better from examples.

1 Jun 2025 1

93%

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

Machine Learning (CS)

Makes AI better at thinking, even small ones.

14 Dec 2025 0

92%

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Computation and Language

Teaches computers to think and use tools better.

12 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com huggingface.co huggingface.co

Page Count

18 pages

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Teaches computers to solve hard problems step-by-step.

Technical Abstract

SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents