MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the challenge of defending against hidden backdoors in language models when pre-trained weights are unavailable. We propose the first lightweight backdoor mitigation framework that operates without access to pre-trained model parameters. Methodologically, our approach leverages only a small set of clean data: (i) token splitting is employed to explicitly generate out-of-distribution (OOD) samples, enhancing OOD robustness; and (ii) attention distillation—implemented via a teacher–student architecture—suppresses backdoor-correlated attention patterns, replacing conventional adversarial min-max optimization with a simpler, more efficient procedure. Key contributions include: (1) the first pre-trained-weight-free backdoor removal method; (2) empirical and theoretical evidence that token splitting strengthens generalizable features while weakening backdoor representations; and (3) competitive backdoor mitigation performance—comparable to state-of-the-art methods relying on pre-trained weights—while preserving original task accuracy, thereby significantly improving model security and practicality under unknown data distributions and resource-constrained deployment scenarios.

Technology Category

Application Category

📝 Abstract

In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

Problem

Research questions and friction points this paper is trying to address.

Language Model

Backdoor Defense

Model Security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Segmentation

Attention Distillation

Backdoor Defense

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models