🤖 AI Summary
Backdoor attacks against large language models (LLMs) are increasingly stealthy and lack prior knowledge of triggers, rendering existing detection and defense methods ineffective.
Method: This paper proposes the first unsupervised, trigger-agnostic defense framework. Its core insight is that injecting known backdoors induces significant clustering of unknown backdoors in the representation space. Leveraging this phenomenon, the method employs a two-stage strategy: (i) identifying the backdoor subspace via controlled injection and representation analysis, and (ii) performing restorative fine-tuning to eradicate malicious behaviors. Crucially, it makes no assumptions about trigger characteristics.
Results: Evaluated across multiple mainstream LLM architectures, the approach reduces average attack success rate to 4.41%, outperforming state-of-the-art methods by 28.1–69.3 percentage points, while degrading clean accuracy by less than 0.5%, thus achieving strong robustness without compromising model performance.
📝 Abstract
Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose ourmethod, a defense framework that requires no prior knowledge of trigger settings. ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. ourmethod leverages this through a two-stage process: extbf{first}, aggregating backdoor representations by injecting known triggers, and extbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%$sim$69.3%$uparrow$. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.