Score: 1

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Published: October 3, 2025 | arXiv ID: 2510.02833v1

By: Zhixin Xie, Xurui Song, Jun Luo

Potential Business Impact:

Makes AI answer bad questions even with good training.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular, a recent study indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However, such malicious fine-tuning attacks are readily detectable and hence thwarted by moderation models. In this paper, we demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs; our attack exploits the increased sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA pairs involving identical refusal answers. Further fine-tuning is then performed with standard benign answers, causing the overfitted LLM to forget the refusal attitude and thus provide compliant answers regardless of the harmfulness of a question. We implement our attack on the ten LLMs and compare it with five existing baselines. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth. Our findings expose previously unreported security vulnerabilities in current LLMs and provide a new perspective on understanding how LLMs' security is compromised, even with benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/tenBenign.

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Cryptography and Security

Breaks AI safety rules even when hidden.

1 Oct 2025 1

91%

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Computation and Language

Keeps AI tutors from giving bad answers.

18 Nov 2025 3

90%

Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare

Cryptography and Security

AI doctors can be tricked into giving bad advice.

27 Jan 2025 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Repos / Data Links

github.com github.com github.com github.com github.com

Page Count

35 pages

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Makes AI answer bad questions even with good training.

Technical Abstract

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare