🤖 AI Summary
This work addresses the inefficiency of conventional content moderation for large language models deployed on user devices, where post-generation filtering with a separate model doubles inference costs and precludes real-time intervention. The authors propose leveraging intrinsic safety signals embedded in the model’s hidden states to train lightweight, token-level linear probes that assess output safety during decoding—without requiring additional forward passes. This approach enables streaming content moderation through early termination or correction during generation, supported by dynamic thresholds and token-level score aggregation. Remarkably, a single probe at an intermediate layer replicates most decisions of a strong standalone moderator while reducing computational overhead by several orders of magnitude, achieving sub-millisecond per-token safety checks with minimal latency and cost.
📝 Abstract
Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.