ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization
By: Kehua Feng , Keyan Ding , Jing Yu and more
Potential Business Impact:
Makes AI safer by teaching it to think first.
Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
Similar Papers
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Computation and Language
Makes AI safer and more helpful.
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Computation and Language
Makes AI safer and cheaper to train.
MPO: Multilingual Safety Alignment via Reward Gap Optimization
Computation and Language
Makes AI safer for everyone, everywhere.