Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of defending large language models against continually evolving jailbreak attacks while maintaining low false rejection rates for benign user requests—a balance unattained by existing defenses. The authors propose Membrane, a self-evolving defense mechanism grounded in Contrastive Safety Memory (CSM). Membrane stores paired representations of harmful queries and their surface-level benign counterparts as memory units, enabling precise safety judgments without model retraining. Its novel contrastive memory architecture allows each unit to generalize across thematic variants of the same attack strategy. By integrating contrastive learning, memory indexing, context-aware retrieval, and knowledge distillation, Membrane constructs a dynamic, updatable safety memory bank. Experiments demonstrate that Membrane achieves state-of-the-art F1 scores against six jailbreak attack types on HarmBench and AgentHarm, with false rejection rates of only 7–14%—substantially lower than prior methods (28–85%)—while exhibiting strong cross-attack transferability and robustness to memory contamination.

📝 Abstract

Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrails tend to over-refuse benign queries that resemble stored attacks. We propose Membrane, a self-evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model-level safety on HarmBench and agent-level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7-14%, well below the 28-85% range of prior guards. Memory cells also retain 87-88% F1 under cross-attack transfer and remain stable under memory poisoning.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

safety alignment

adaptive defense

over-refusal

LLM safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Safety Memory

self-evolving guardrail

jailbreak defense