Score: 0

NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

Published: April 29, 2025 | arXiv ID: 2504.21053v1

By: Yi Zhou , Wenpeng Xing , Dezhang Kong and more

Potential Business Impact:

Makes AI say bad things it was told not to.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

Country of Origin
šŸ‡ØšŸ‡³ China

Page Count
10 pages

Category
Computer Science:
Machine Learning (CS)