Score: 0

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Published: August 21, 2025 | arXiv ID: 2508.15182v1

By: Xiangman Li , Xiaodong Wu , Qi Li and more

Potential Business Impact:

Makes AI models forget bad things they learned.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Computation and Language

Keeps AI tutors from giving bad answers.

18 Nov 2025 3

92%

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

Cryptography and Security

Makes AI models safer from bad instructions.

2 Feb 2025 1

92%

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Computation and Language

Makes AI safer and less likely to say bad things.

7 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Page Count

13 pages

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Makes AI models forget bad things they learned.

Technical Abstract

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation