🤖 AI Summary
To address the vulnerability of large language models (LLMs) to jailbreaking attacks—posing serious risks to safety and trustworthiness—this paper proposes a graph attention-based hierarchical filtering framework. The method introduces, for the first time, a multi-relational graph structure integrating sequential, syntactic, and self-attention relations for jailbreak detection. It employs a two-tier graph neural network: an upper tier performs global prompt-level classification, while a lower tier enables fine-grained localization of adversarial segments. By unifying dependency parsing, self-attention modeling, and multi-scale graph learning, the approach balances semantic depth with structural sensitivity. Evaluated on multiple benchmark datasets, it achieves 99.8% F₁-score at the prompt level, 91% at the token level, and a 28% improvement in IoU, with low inference latency suitable for real-world deployment. The core contributions are the novel construction of a structured multi-relational graph and the first hierarchical jailbreak-aware detection architecture.
📝 Abstract
Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4% to 99.8% on LLM-Fuzzer, and from 67-79% to over 94% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75% to 74-91%, with IoU gains up to +28%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.