Score: 0

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Published: November 27, 2025 | arXiv ID: 2511.22044v1

By: Tianyu Zhang , Zihang Xi , Jingyu Hua and more

Potential Business Impact:

Makes AI safer by predicting bad prompts.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Cryptography and Security

Makes AI systems safer from hacking tricks.

24 Oct 2025 0

89%

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Cryptography and Security

Stops AI from saying bad or unsafe things.

24 Nov 2025 1

89%

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Cryptography and Security

Finds ways to trick AI into saying bad things.

17 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Makes AI safer by predicting bad prompts.

Technical Abstract

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness