Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
By: Pin-Yu Chen , Han Shen , Payel Das and more
Potential Business Impact:
Makes AI smarter without making it unsafe.
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
Similar Papers
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Machine Learning (CS)
Keeps AI safe when learning new things.
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?
Computers and Society
Fixes AI that talks to you to be safe.
Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
Computation and Language
Keeps smart computer programs safe when learning new things.