Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the real-time detection challenge of policy violations by proprietary large language models (LLMs) in sensitive domains such as law and finance, this paper proposes an out-of-distribution (OOD) detection method that requires no fine-tuning, incurs low latency, and ensures high interpretability. Our core innovation is *activation-space whitening*: a linear transformation applied to hidden-layer activations to enforce zero mean, unit variance, and decorrelation—followed by Euclidean norm computation as a compliance score. The method enables lightweight deployment using only a small set of policy examples and textual policy descriptions. Evaluated on a multi-domain policy benchmark, it significantly outperforms rule-based guardrails and fine-tuned inference models. Grounded in sound theoretical principles and validated for practical deployment, our approach delivers the first training-free, interpretable, plug-and-play solution for policy alignment in AI governance.

Technology Category

Application Category

📝 Abstract

Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection

Problem

Research questions and friction points this paper is trying to address.

Detects policy violations in LLMs without training

Uses activation-space whitening for out-of-distribution detection

Provides lightweight, interpretable oversight for organizational policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method for policy violation detection

Uses activation-space whitening for OOD detection

Light-weight approach requiring only policy text and samples

🔎 Similar Papers

Get my drift? Catching LLM Task Drift with Activation Deltas