Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing backdoor defenses suffer from low accuracy due to inaccurate estimation of Trigger Activation Changes (TAC), which hampers reliable identification and removal of backdoor functionality. Method: This paper proposes a robust backdoor mitigation method based on precise TAC reconstruction. It formulates minimal perturbation as a convex quadratic optimization problem, using its optimal solution as a surrogate for TAC. Leveraging latent-layer representation analysis and L2-norm statistics, the method enables fine-grained identification of poisoned classes and localization of backdoor neurons. Targeted fine-tuning is then applied to eliminate the backdoor precisely. Results: Evaluated on CIFAR-10, GTSRB, and TinyImageNet against diverse backdoor attacks, model architectures, and datasets, the method reduces attack success rate to <1% while preserving >98% clean-sample accuracy—significantly outperforming state-of-the-art defenses.

Technology Category

Application Category

📝 Abstract
Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.
Problem

Research questions and friction points this paper is trying to address.

Accurately reconstructing trigger-activated changes in latent representations to counter backdoor attacks
Identifying poisoned classes through statistical analysis of minimal perturbation norms
Developing robust backdoor removal method that maintains high clean data accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs trigger-activated changes in latent representation
Formulates minimal perturbation as convex quadratic optimization
Identifies poisoned class via small L2 norms in perturbations
K
Kazuki Iwahana
NTT Social Informatics Laboratories
Y
Yusuke Yamasaki
NTT Social Informatics Laboratories
A
Akira Ito
NTT Social Informatics Laboratories
T
Takayuki Miura
NTT Social Informatics Laboratories
Toshiki Shibahara
Toshiki Shibahara
NTT Social Informatics Laboratories
Cyber SecurityMachine Learning