Score: 0

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Published: October 16, 2025 | arXiv ID: 2510.15017v1

By: ChenYu Wu, Yi Wang, Yang Liao

Potential Business Impact:

Stops bad computer instructions by tricking them.

Business Areas:

Intrusion Detection Information Technology, Privacy and Security

Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM's safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.

HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models

Cryptography and Security

Breaks AI's safety rules to get answers.

21 Oct 2025 1

89%

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

Cryptography and Security

Bypasses AI safety rules to make it do bad things.

15 Apr 2025 1

89%

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Cryptography and Security

Stops AI from saying bad or unsafe things.

24 Nov 2025 1

View PDF Login to Bookmark

Page Count

6 pages

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

Stops bad computer instructions by tricking them.

Technical Abstract

HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations