A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses jailbreaking vulnerabilities in large language models (LLMs), wherein adversarial prompts circumvent alignment mechanisms to elicit unsafe outputs. Recognizing the limitation of existing taxonomies—largely based on prompt engineering without deep causal attribution—the work introduces a novel, training-domain-centric classification framework for jailbreaking attacks. It systematically attributes jailbreaking origins to four domain-level misalignments: generalization mismatch, objective conflict, robustness deficiency, and hybrid-domain attacks. Through domain analysis, failure-mode attribution modeling of alignment breakdowns, and abstraction of cross-domain attack patterns, the paper establishes the first structured taxonomy, explicitly linking each category to underlying model defect mechanisms and distilling key design principles for enhancing alignment robustness. This framework is the first to uncover the本质 of alignment failure at the domain level, providing both theoretical foundations and actionable guidelines for developing effective defenses.

Technology Category

Application Category

📝 Abstract

The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.

Problem

Research questions and friction points this paper is trying to address.

Classifying jailbreak attacks based on LLM training domain gaps

Analyzing alignment failures via generalization and robustness gaps

Proposing a taxonomy for LLM jailbreak vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel taxonomy based on LLM training domains

Classifies jailbreak attacks by model deficiencies

Four categories: generalization, objectives, robustness, mixed

🔎 Similar Papers

No similar papers found.

Authors to Follow