Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the real-time detection challenge of policy violations by proprietary large language models (LLMs) in sensitive domains such as law and finance, this paper proposes an out-of-distribution (OOD) detection method that requires no fine-tuning, incurs low latency, and ensures high interpretability. Our core innovation is *activation-space whitening*: a linear transformation applied to hidden-layer activations to enforce zero mean, unit variance, and decorrelation—followed by Euclidean norm computation as a compliance score. The method enables lightweight deployment using only a small set of policy examples and textual policy descriptions. Evaluated on a multi-domain policy benchmark, it significantly outperforms rule-based guardrails and fine-tuned inference models. Grounded in sound theoretical principles and validated for practical deployment, our approach delivers the first training-free, interpretable, plug-and-play solution for policy alignment in AI governance.

Technology Category

Application Category

📝 Abstract
Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection
Problem

Research questions and friction points this paper is trying to address.

Detects policy violations in LLMs without training
Uses activation-space whitening for out-of-distribution detection
Provides lightweight, interpretable oversight for organizational policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method for policy violation detection
Uses activation-space whitening for OOD detection
Light-weight approach requiring only policy text and samples
🔎 Similar Papers
O
Oren Rachmil
Fujitsu Research of Europe, Israel
Roy Betser
Roy Betser
Ph.D. candidate at the Technion – Israel Institute of Technology
Computer VisionMachine LearningAgentic AI
I
Itay Gershon
Fujitsu Research of Europe, Israel
Omer Hofman
Omer Hofman
Ben-Gurion University of the Negev
N
Nitay Yakoby
Ben-Gurion University of the Negev, Israel
Y
Y. Meron
Ben-Gurion University of the Negev, Israel
I
Idan Yankelev
Ben-Gurion University of the Negev, Israel
A
A. Shabtai
Ben-Gurion University of the Negev, Israel
Y
Y. Elovici
Ben-Gurion University of the Negev, Israel
Roman Vainshtein
Roman Vainshtein
Ph.D.
GenAI Security and TrustMachine LearningAutoMLData ScienceAI Robustness and Security