Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Backdoor attacks against large language models (LLMs) are increasingly stealthy and lack prior knowledge of triggers, rendering existing detection and defense methods ineffective. Method: This paper proposes the first unsupervised, trigger-agnostic defense framework. Its core insight is that injecting known backdoors induces significant clustering of unknown backdoors in the representation space. Leveraging this phenomenon, the method employs a two-stage strategy: (i) identifying the backdoor subspace via controlled injection and representation analysis, and (ii) performing restorative fine-tuning to eradicate malicious behaviors. Crucially, it makes no assumptions about trigger characteristics. Results: Evaluated across multiple mainstream LLM architectures, the approach reduces average attack success rate to 4.41%, outperforming state-of-the-art methods by 28.1–69.3 percentage points, while degrading clean accuracy by less than 0.5%, thus achieving strong robustness without compromising model performance.

Technology Category

Application Category

📝 Abstract

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose ourmethod, a defense framework that requires no prior knowledge of trigger settings. ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. ourmethod leverages this through a two-stage process: extbf{first}, aggregating backdoor representations by injecting known triggers, and extbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%$sim$69.3%$uparrow$. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

Problem

Research questions and friction points this paper is trying to address.

Defending language models against backdoor attacks without trigger knowledge

Aggregating known and unknown backdoors in representation space

Preserving model utility while removing multiple backdoor threats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting known backdoors to aggregate hidden threats

Performing recovery fine-tuning to restore benign outputs

Requiring no prior knowledge of trigger settings

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models