AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
By: Shuo Yang , Qihui Zhang , Yuyang Liu and more
Potential Business Impact:
Keeps AI safe from bad training data.
Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT
Similar Papers
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Cryptography and Security
Makes AI safer by changing its training data.
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Machine Learning (Stat)
Makes AI smarter without making it unsafe.
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?
Computers and Society
Fixes AI that talks to you to be safe.