HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
By: Sidhant Narula , Javad Rafiei Asl , Mohammad Ghasemigol and more
Potential Business Impact:
Breaks AI's safety rules to get answers.
Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.
Similar Papers
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Cryptography and Security
Stops bad computer instructions by tricking them.
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Cryptography and Security
Stops AI from saying bad or unsafe things.
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Cryptography and Security
Finds ways to trick smart computer programs.