Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
By: Qingyu Ren , Qianyu He , Bowei Zhang and more
Potential Business Impact:
Trains smart AIs to obey better without losing cleverness
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.
Similar Papers
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
Computation and Language
Teaches computers to follow complex orders perfectly.
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
CV and Pattern Recognition
Teaches computers to follow tricky directions better.
ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
Machine Learning (CS)
Makes AI follow instructions while thinking.