๐ค AI Summary
This work proposes an immunology-inspired purification framework to address the challenge of defending large language models against backdoor attacks in the absence of prior knowledge about triggers or access to a clean reference model. By analyzing the redundant encoding of backdoor behaviors in MLP layers, the method synthesizes diverse backdoor variants and employs contrastive learning against the original model to identify and neutralize shared โbackdoor features.โ A lightweight fine-tuning step subsequently restores the modelโs generative capabilities. This study presents the first approach capable of universally purifying generative large language models from backdoors without requiring trigger information or a clean model, significantly enhancing robustness against various backdoor attacks while preserving original generation performance.
๐ Abstract
Backdoor attacks pose severe security threats to large language models (LLMs), where a model behaves normally under benign inputs but produces malicious outputs when a hidden trigger appears. Existing backdoor removal methods typically assume prior knowledge of triggers, access to a clean reference model, or rely on aggressive finetuning configurations, and are often limited to classification tasks. However, such assumptions fall apart in real-world instruction-tuned LLM settings. In this work, we propose a new framework for purifying instruction-tuned LLM without any prior trigger knowledge or clean references. Through systematic sanity checks, we find that backdoor associations are redundantly encoded across MLP layers, while attention modules primarily amplify trigger signals without establishing the behavior. Leveraging this insight, we shift the focus from isolating specific backdoor triggers to cutting off the trigger-behavior associations, and design an immunization-inspired elimination approach: by constructing multiple synthetic backdoored variants of the given suspicious model, each trained with different malicious trigger-behavior pairs, and contrasting them with their clean counterparts. The recurring modifications across variants reveal a shared "backdoor signature"-analogous to antigens in a virus. Guided by this signature, we neutralize highly suspicious components in LLM and apply lightweight finetuning to restore its fluency, producing purified models that withstand diverse backdoor attacks and threat models while preserving generative capability.