Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the critical threat of multiple unknown backdoor attacks against large language models (LLMs), a challenge inadequately handled by existing defenses that rely on known triggers and target only single backdoors. The authors propose a novel paradigm: by deliberately injecting and then unlearning a single, controllable backdoor, they exploit its cross-backdoor generalization effect to indirectly suppress numerous unknown backdoors. For the first time, the study demonstrates and validates that backdoor unlearning exhibits generalization capabilities beyond the injected trigger. To analyze the relationship between model updates during unlearning, the authors introduce techniques such as cross-activation shift distance. Extensive experiments across three major LLM families show that unlearning just one backdoor significantly weakens diverse unknown backdoors, offering an efficient and broadly applicable new approach to enhancing LLM security.
📝 Abstract
Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.
Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks
Large Language Models
Unknown triggers
Model security
Backdoor unlearning
Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor unlearning
generalization
large language models
Cross Activation Shift Distance
trigger removal
🔎 Similar Papers