๐ค AI Summary
This work addresses memory poisoning attacks across sessions (MSMP) in persistent large language model agent systems, where adversaries inject malicious memories through benign interactions to manipulate future responses. To counter this threat, the paper proposes SMSRโa framework that provides the first certified robustness guarantee against multi-session memory poisoning. SMSR integrates HMAC-SHA256 memory signing at write time, random memory ablation during retrieval, and a verdict-based majority voting mechanism, enabling runtime certified defense grounded in the hypergeometric distribution. Theoretical analysis reveals a โconsistent minority effect,โ proving that source-agnostic filtering at retrieval cannot achieve certified security. Experiments across 15 enterprise scenarios show that SMSR reduces the success rate of unsigned attacks from 93โ100% to 0%, limits single-certification attack success to below 8.0%, and decreases end-to-end attack success from 65.3% to 5.3%, while preserving over 85% utility on clean queries.
๐ Abstract
Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).