NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
By: Yi Zhou , Wenpeng Xing , Dezhang Kong and more
Potential Business Impact:
Makes AI say bad things it was told not to.
Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.
Similar Papers
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
Cryptography and Security
Makes AI say bad things by tricking its safety guards.
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Artificial Intelligence
Makes AI safer from bad instructions.
NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
Machine Learning (CS)
Makes AI helpful and safe, not confused.