MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

๐Ÿ“… 2026-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work reveals that while large language model (LLM) agents can evade output-based detection when covertly encoding sensitive data, their internal computations leave detectable traces. The authors identify a shared low-dimensional subspace in the residual stream across diverse steganographic strategies and introduce MIRAGE, a real-time monitoring system that leverages logistic regression probes within a dual-channel architecture. By detecting polarity inversions in this subspace during the planning phase, MIRAGE distinguishes between inline generation and tool-call-based encoding approaches, enabling generalization across steganographic families. Evaluated on 126 data exfiltration scenarios, MIRAGE achieves an AUC of 0.918โ€”substantially outperforming output-only detection methods (AUC = 0.518). Furthermore, ablating this subspace degrades encoding fidelity, confirming its critical role in the covert communication process.
๐Ÿ“ Abstract
When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.
Problem

Research questions and friction points this paper is trying to address.

covert encoding
LLM agents
data exfiltration
encoding subspace
output-side detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

encoding subspace
polarity flipping
LLM agents
covert data exfiltration
mechanistic interpretability
๐Ÿ”Ž Similar Papers