Evaluating Adversarial Vulnerabilities in Modern Large Language Models
By: Tom Perel
Potential Business Impact:
Finds ways to trick AI into saying bad things.
The recent boom and rapid integration of Large Language Models (LLMs) into a wide range of applications warrants a deeper understanding of their security and safety vulnerabilities. This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4 (specifically the GPT-4o mini model accessible in the free tier). The research utilized two main bypass strategies: 'self-bypass', where models were prompted to circumvent their own safety protocols, and 'cross-bypass', where one model generated adversarial prompts to exploit vulnerabilities in the other. Four attack methods were employed - direct injection, role-playing, context manipulation, and obfuscation - to generate five distinct categories of unsafe content: hate speech, illegal activities, malicious code, dangerous content, and misinformation. The success of the attack was determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations in their safety implementations or architectural design. Cross-bypass attacks were particularly effective, indicating that an ample amount of vulnerabilities exist in the underlying transformer architecture. This research contributes a scalable framework for automated AI red-teaming and provides data-driven insights into the current state of LLM safety, underscoring the complex challenge of balancing model capabilities with robust safety mechanisms.
Similar Papers
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Cryptography and Security
Stops AI from saying bad or unsafe things.
Bypassing Prompt Guards in Production with Controlled-Release Prompting
Machine Learning (CS)
Breaks AI safety rules by tricking chatbots.
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models
Computation and Language
Breaks AI safety rules in different languages.