Score: 1

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Published: December 3, 2025 | arXiv ID: 2512.03356v1

By: Jun Leng, Litian Zhang, Xi Zhang

Potential Business Impact:

AI learns to remember and block bad instructions.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Cryptography and Security

Stops AI from saying bad or unsafe things.

24 Nov 2025 1

90%

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Cryptography and Security

Stops AI from saying bad things, even new tricks.

22 Aug 2025 1

90%

Activation-Guided Local Editing for Jailbreaking Attacks

Cryptography and Security

Finds AI flaws to build stronger defenses

1 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

15 pages

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

AI learns to remember and block bad instructions.

Technical Abstract

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Activation-Guided Local Editing for Jailbreaking Attacks