TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing jailbreak defense mechanisms, which often overlook the dynamic evolution of risk signals during decoding and fail to effectively leverage informative patterns in hidden state trajectories. The authors propose a training-free, model-agnostic defense framework that operates at decoding time: by aggregating hidden states from key layers within a sliding window, it continuously quantifies generation risk in real time and triggers a lightweight semantic adjudication mechanism whenever risk exceeds a threshold persistently within a local window, thereby dynamically intervening in the generation process. The study reveals for the first time that hidden states during decoding contain stronger and more stable risk signals than previously recognized. Empirical evaluations demonstrate that the method achieves an average defense success rate of 95% across 12 diverse jailbreak attacks and multiple open-source large language models, with a detection latency of only 5.2 milliseconds per token and a false positive rate below 1.5%.
📝 Abstract
Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.
Problem

Research questions and friction points this paper is trying to address.

jailbreak defense
decoding trajectory
hidden-state dynamics
real-time detection
LLM security
Innovation

Methods, ideas, or system contributions that make the work stand out.

hidden-state trajectory
decoding-time defense
jailbreak detection
training-free
real-time risk assessment
🔎 Similar Papers
No similar papers found.