Score: 1

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Published: August 22, 2025 | arXiv ID: 2508.16406v1

By: Guangyu Yang , Jinghong Chen , Jingbiao Mei and more

Potential Business Impact:

Stops AI from saying bad things, even new tricks.

Business Areas:

Augmented Reality Hardware, Software

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation

Cryptography and Security

Helps computers spot new cyber threats faster.

31 Oct 2025 0

90%

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Cryptography and Security

AI learns to remember and block bad instructions.

3 Dec 2025 1

90%

Secure Retrieval-Augmented Generation against Poisoning Attacks

Cryptography and Security

Stops bad info from tricking smart computer programs.

28 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

13 pages

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Stops AI from saying bad things, even new tricks.

Technical Abstract

Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Secure Retrieval-Augmented Generation against Poisoning Attacks