Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
By: Yuyou Zhang , Miao Li , William Han and more
Potential Business Impact:
Teaches AI to think safely before answering.
Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.
Similar Papers
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards
Artificial Intelligence
Keeps AI from saying bad things.
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Computation and Language
Reasoning makes AI more easily tricked by bias.
Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
Computation and Language
Makes smart computers think safely and correctly.