SoK: Evaluating Jailbreak Guardrails for Large Language Models
By: Xunguang Wang , Zhenlan Ji , Wenxuan Wang and more
Potential Business Impact:
Protects AI from harmful instructions.
Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.
Similar Papers
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Cryptography and Security
Makes AI safer from bad instructions.
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Cryptography and Security
Makes AI safer from bad instructions.
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
Cryptography and Security
Makes AI safer from bad instructions.