STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, while existing defenses suffer from poor robustness or prohibitively high computational overhead. To address this, we propose STShield—a lightweight, real-time jailbreak detection framework. Its core innovation is the novel *single-token sentinel mechanism*: a binary safety token appended to the LLM’s output sequence, enabling detection solely via the model’s intrinsic alignment capabilities—no external classifier is required. STShield integrates embedding-space adversarial training with supervised fine-tuning to jointly optimize robustness and inference efficiency. Extensive experiments demonstrate that STShield achieves significantly higher detection accuracy than state-of-the-art baselines across diverse jailbreaking attack types, incurs negligible inference overhead (<0.5% latency increase), and preserves near-original performance on legitimate queries (accuracy drop <0.3%).

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Detects jailbreak attacks in real-time for LLMs

Uses single-token sentinel for lightweight safety checks

Combats adaptive attacks without heavy computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-token sentinel for real-time detection

Leverages LLM's alignment for lightweight defense

Combines supervised and adversarial training

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation