🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) in knowledge-based question answering, where the absence of fine-grained access control can lead to sensitive information leakage. The study reveals for the first time that queries with different permission levels exhibit distinct geometric clustering patterns in the activation space. Building on this insight, the authors propose AAAC, a training-free, multi-class access control framework that constructs a library of permission-specific anchor points and steers model activations toward authorized regions during inference to suppress unauthorized outputs. Relying solely on geometric analysis of the activation space and multi-anchor guidance—without any model fine-tuning—AAAC reduces permission violations by up to 86.5% and decreases prompt attack success rates by 90.7% across three major LLMs, while preserving high response quality and incurring minimal inference overhead.
📝 Abstract
Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user's permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query's activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.