Score: 1

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Published: October 1, 2025 | arXiv ID: 2510.01342v1

By: Xiangfang Li, Yu Wang, Bo Li

Potential Business Impact:

Breaks AI safety rules even when hidden.

Business Areas:

Penetration Testing Information Technology, Privacy and Security

With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Cryptography and Security

Makes AI models trickable into doing bad things.

26 Feb 2025 2

91%

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Computation and Language

Keeps AI tutors from giving bad answers.

18 Nov 2025 3

91%

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Cryptography and Security

Makes AI answer bad questions even with good training.

3 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

21 pages

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Breaks AI safety rules even when hidden.

Technical Abstract

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs