Score: 0

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Published: August 8, 2025 | arXiv ID: 2508.09190v3

By: Bing Han , Feifei Zhao , Dongcheng Zhao and more

Potential Business Impact:

Makes AI safer and smarter without breaking it.

Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.

Understanding and Preserving Safety in Fine-Tuned LLMs

Machine Learning (CS)

Keeps AI helpful and safe when learning new things.

15 Jan 2026 1

91%

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Machine Learning (CS)

Makes AI helpful and safe, not confused.

13 Aug 2025 1

89%

NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

Machine Learning (CS)

Makes AI say bad things it was told not to.

29 Apr 2025 0

View PDF Login to Bookmark

Page Count

9 pages

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Makes AI safer and smarter without breaking it.

Technical Abstract

Understanding and Preserving Safety in Fine-Tuned LLMs

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models