Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
By: Zhengchun Shang, Wenlan Wei
Potential Business Impact:
Makes AI safer from bad instructions.
Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance model robustness. Our study evaluates both open-source models (e.g., LLaMA and Mistral) and closed-source systems (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches.
Similar Papers
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem
Cryptography and Security
Protects smart AI from being tricked.
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Cryptography and Security
Stops AI from saying bad or unsafe things.
Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
Computation and Language
Makes AI safer in all languages.