Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models
By: Yutao Mou , Xiaoling Zhou , Yuxiao Luo and more
Potential Business Impact:
Makes AI safe without losing smarts.
Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.
Similar Papers
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
Cryptography and Security
Makes smart computers safer, but less clever.
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
Machine Learning (CS)
Keeps AI safe while learning new tasks
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Computation and Language
Makes AI safer and more helpful.