SoK: Evaluating Jailbreak Guardrails for Large Language Models

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, yet existing external guardrails lack a unified taxonomy and systematic evaluation, hindering comparative analysis and optimization of defense efficacy. To address this, we propose the first six-dimensional classification framework specifically designed for LLM jailbreak mitigation, coupled with a tri-dimensional evaluation framework assessing security, efficiency, and utility. Through empirical cross-attack experiments, dynamic response analysis, adversarial robustness measurement, and utility fidelity assessment, we systematically benchmark state-of-the-art guardrail mechanisms. Our study uncovers fundamental trade-offs among security, efficiency, and utility, identifies critical weaknesses and operational boundaries of current approaches, and derives principled composition strategies. We empirically validate that optimized guardrail combinations improve defense success rates by over 40%. All code and reproducible benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vulnerabilities of LLMs to jailbreak attacks

Lack of unified taxonomy for LLM guardrail mechanisms

Need for comprehensive evaluation framework for guardrail effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes multi-dimensional taxonomy for guardrails

Introduces Security-Efficiency-Utility evaluation framework

Analyzes universality of guardrails across attacks

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance