Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting stealthy backdoor samples in large language models (LLMs) remains challenging, as existing methods either lack applicability to generative tasks or degrade generation performance. Method: We propose RFTC, an unsupervised detection framework comprising three stages: (1) suspicious sample identification via reference-model output comparison; (2) response distribution analysis using TF-IDF embeddings and k-means clustering, revealing—novelly—that poisoned samples exhibit significantly smaller intra-cluster Euclidean distances in TF-IDF space; and (3) synergistic integration of Reference-Filtration and TF-IDF Clustering for precise detection. RFTC requires no model fine-tuning or rewriting, preserving both detection accuracy and generation fidelity. Results: Evaluated on two machine translation and one question-answering dataset, RFTC achieves a 12.7% average improvement in detection accuracy, reduces false positive rate to 1.3%, and incurs zero performance degradation in downstream generation tasks.

Technology Category

Application Category

📝 Abstract
Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.
Problem

Research questions and friction points this paper is trying to address.

Detect stealthy backdoor samples in LLMs using intra-class distance
Address limitations of current backdoor detection methods for generation tasks
Propose RFTC method combining Reference-Filtration and Tfidf-Clustering mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses TF-IDF clustering for backdoor detection
Compares samples with reference model outputs
Employs intra-class distance to identify poisoned samples
🔎 Similar Papers
No similar papers found.
Jinwen Chen
Jinwen Chen
University of Electronic Science and Technology of China
spatial crowdsourcing
Hainan Zhang
Hainan Zhang
Beihang University
Dialogue GenerationText GenerationFederated LearningNatural Language Processing
F
Fei Sun
Institute of Computing Technology, Chinese Academy of Sciences
Q
Qinnan Zhang
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University, China
S
Sijia Wen
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University, China
Z
Ziwei Wang
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University, China
Z
Zhiming Zheng
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University, China