A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses jailbreaking vulnerabilities in large language models (LLMs), wherein adversarial prompts circumvent alignment mechanisms to elicit unsafe outputs. Recognizing the limitation of existing taxonomies—largely based on prompt engineering without deep causal attribution—the work introduces a novel, training-domain-centric classification framework for jailbreaking attacks. It systematically attributes jailbreaking origins to four domain-level misalignments: generalization mismatch, objective conflict, robustness deficiency, and hybrid-domain attacks. Through domain analysis, failure-mode attribution modeling of alignment breakdowns, and abstraction of cross-domain attack patterns, the paper establishes the first structured taxonomy, explicitly linking each category to underlying model defect mechanisms and distilling key design principles for enhancing alignment robustness. This framework is the first to uncover the本质 of alignment failure at the domain level, providing both theoretical foundations and actionable guidelines for developing effective defenses.

Technology Category

Application Category

📝 Abstract
The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.
Problem

Research questions and friction points this paper is trying to address.

Classifying jailbreak attacks based on LLM training domain gaps
Analyzing alignment failures via generalization and robustness gaps
Proposing a taxonomy for LLM jailbreak vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel taxonomy based on LLM training domains
Classifies jailbreak attacks by model deficiencies
Four categories: generalization, objectives, robustness, mixed
🔎 Similar Papers
No similar papers found.
C
Carlos Peláez-González
Department of Computer Science and Artificial Intelligence, Andalusian Institute of Data Science and Computational Intelligence (DaSCI), University of Granada, Spain.
Andrés Herrera-Poyatos
Andrés Herrera-Poyatos
Lecturer at University of Granada, deparment of Algebra. PhD from the University of Oxford
Randomised algorithmsComputational ComplexityCombinatoricsDeep Learning
Cristina Zuheros
Cristina Zuheros
University of Granada
Deep LearningSocial NetworksDecision MakingComputing with Words
D
David Herrera-Poyatos
Department of Computer Science and Artificial Intelligence, Andalusian Institute of Data Science and Computational Intelligence (DaSCI), University of Granada, Spain.
V
Virilo Tejedor
Department of Computer Science and Artificial Intelligence, Andalusian Institute of Data Science and Computational Intelligence (DaSCI), University of Granada, Spain.
Francisco Herrera
Francisco Herrera
Professor Computer Science and AI, DaSCI Research Institute, Granada University, Spain
Artificial IntelligenceComputational IntelligenceData ScienceTrustworthy AI