Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the critical threat of multiple unknown backdoor attacks against large language models (LLMs), a challenge inadequately handled by existing defenses that rely on known triggers and target only single backdoors. The authors propose a novel paradigm: by deliberately injecting and then unlearning a single, controllable backdoor, they exploit its cross-backdoor generalization effect to indirectly suppress numerous unknown backdoors. For the first time, the study demonstrates and validates that backdoor unlearning exhibits generalization capabilities beyond the injected trigger. To analyze the relationship between model updates during unlearning, the authors introduce techniques such as cross-activation shift distance. Extensive experiments across three major LLM families show that unlearning just one backdoor significantly weakens diverse unknown backdoors, offering an efficient and broadly applicable new approach to enhancing LLM security.

📝 Abstract

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks

Large Language Models

Unknown triggers

Model security

Backdoor unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor unlearning

generalization

large language models