MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work reveals that while large language model (LLM) agents can evade output-based detection when covertly encoding sensitive data, their internal computations leave detectable traces. The authors identify a shared low-dimensional subspace in the residual stream across diverse steganographic strategies and introduce MIRAGE, a real-time monitoring system that leverages logistic regression probes within a dual-channel architecture. By detecting polarity inversions in this subspace during the planning phase, MIRAGE distinguishes between inline generation and tool-call-based encoding approaches, enabling generalization across steganographic families. Evaluated on 126 data exfiltration scenarios, MIRAGE achieves an AUC of 0.918—substantially outperforming output-only detection methods (AUC = 0.518). Furthermore, ablating this subspace degrades encoding fidelity, confirming its critical role in the covert communication process.

📝 Abstract

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

Problem

Research questions and friction points this paper is trying to address.

covert encoding

LLM agents

data exfiltration

encoding subspace

output-side detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

encoding subspace

polarity flipping

LLM agents