Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing backdoor defenses suffer from low accuracy due to inaccurate estimation of Trigger Activation Changes (TAC), which hampers reliable identification and removal of backdoor functionality. Method: This paper proposes a robust backdoor mitigation method based on precise TAC reconstruction. It formulates minimal perturbation as a convex quadratic optimization problem, using its optimal solution as a surrogate for TAC. Leveraging latent-layer representation analysis and L2-norm statistics, the method enables fine-grained identification of poisoned classes and localization of backdoor neurons. Targeted fine-tuning is then applied to eliminate the backdoor precisely. Results: Evaluated on CIFAR-10, GTSRB, and TinyImageNet against diverse backdoor attacks, model architectures, and datasets, the method reduces attack success rate to <1% while preserving >98% clean-sample accuracy—significantly outperforming state-of-the-art defenses.

Technology Category

Application Category

📝 Abstract

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

Problem

Research questions and friction points this paper is trying to address.

Accurately reconstructing trigger-activated changes in latent representations to counter backdoor attacks

Identifying poisoned classes through statistical analysis of minimal perturbation norms

Developing robust backdoor removal method that maintains high clean data accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs trigger-activated changes in latent representation

Formulates minimal perturbation as convex quadratic optimization

Identifies poisoned class via small L2 norms in perturbations

🔎 Similar Papers

Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning

2024-05-23arXiv.orgCitations: 0

Authors to Follow