R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge
By: Yeonjun In , Wonjoong Kim , Sangwu Park and more
Potential Business Impact:
Teaches AI to use its safety knowledge better.
Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.
Similar Papers
Risk-adaptive Activation Steering for Safe Multimodal Large Language Models
CV and Pattern Recognition
AI learns to spot bad pictures and be helpful.
Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
Computation and Language
Makes smart computers think safely and correctly.
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Computers and Society
Makes smart AI safer to use.