GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the vulnerability of large language models (LLMs) to jailbreaking attacks—posing serious risks to safety and trustworthiness—this paper proposes a graph attention-based hierarchical filtering framework. The method introduces, for the first time, a multi-relational graph structure integrating sequential, syntactic, and self-attention relations for jailbreak detection. It employs a two-tier graph neural network: an upper tier performs global prompt-level classification, while a lower tier enables fine-grained localization of adversarial segments. By unifying dependency parsing, self-attention modeling, and multi-scale graph learning, the approach balances semantic depth with structural sensitivity. Evaluated on multiple benchmark datasets, it achieves 99.8% F₁-score at the prompt level, 91% at the token level, and a 28% improvement in IoU, with low inference latency suitable for real-world deployment. The core contributions are the novel construction of a structured multi-relational graph and the first hierarchical jailbreak-aware detection architecture.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4% to 99.8% on LLM-Fuzzer, and from 67-79% to over 94% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75% to 74-91%, with IoU gains up to +28%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.
Problem

Research questions and friction points this paper is trying to address.

Detecting and filtering jailbreak prompts in Large Language Models
Improving safety against adversarial attacks using graph neural networks
Enhancing prompt-level and token-level jailbreak detection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical filtering framework detects jailbreak prompts
Graph neural networks analyze linguistic and contextual patterns
Two-level filtering identifies global prompts and adversarial tokens
🔎 Similar Papers
No similar papers found.