Score: 1

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

Published: December 29, 2025 | arXiv ID: 2512.23457v1

By: Kongcheng Zhang , Qi Yao , Shunyu Liu and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Teaches AI to learn from its mistakes.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Artificial Intelligence

Trains smart AIs to obey better without losing cleverness

4 Aug 2025 1

89%

Checklists Are Better Than Reward Models For Aligning Language Models

Computation and Language

Teaches computers to follow all kinds of instructions.

24 Jul 2025 0

89%

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Computation and Language

Teaches computers to follow complex orders perfectly.

16 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

16 pages

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

Teaches AI to learn from its mistakes.

Technical Abstract

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Checklists Are Better Than Reward Models For Aligning Language Models

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following