SaRO: Enhancing LLM Safety through Reasoning-based Alignment
By: Yutao Mou , Yuxiao Luo , Shikun Zhang and more
Potential Business Impact:
Makes AI safer and more helpful.
Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
Similar Papers
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Cryptography and Security
Makes AI understand bad requests better.
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization
Computation and Language
Makes AI safer by teaching it to think first.
LoRA is All You Need for Safety Alignment of Reasoning LLMs
Artificial Intelligence
Keeps AI smart and safe without losing skills.