BDFirewall: Towards Effective and Expeditiously Black-Box Backdoor Defense in MLaaS

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In black-box Machine Learning-as-a-Service (MLaaS) settings, detecting and defending against diverse backdoor triggers remains challenging due to lack of model access. To address this, we propose BDFirewall—a model-agnostic, progressive defense framework. BDFirewall innovatively classifies triggers into high-, medium-, and low-visibility categories based on their local influence magnitude, and applies stage-specific purification strategies: saliency-based detection for high-visibility triggers; local denoising guided by benign features for medium-visibility ones; and lightweight noise perturbation coupled with DDPM-based reverse denoising modeling for low-visibility triggers. Extensive experiments demonstrate that BDFirewall reduces attack success rates by 33.25% on average, improves clean accuracy on poisoned samples by 29.64%, and achieves up to 111× faster inference—outperforming existing black-box defenses significantly.

Technology Category

Application Category

📝 Abstract
In this paper, we endeavor to address the challenges of backdoor attacks countermeasures in black-box scenarios, thereby fortifying the security of inference under MLaaS. We first categorize backdoor triggers from a new perspective, i.e., their impact on the patched area, and divide them into: high-visibility triggers (HVT), semi-visibility triggers (SVT), and low-visibility triggers (LVT). Based on this classification, we propose a progressive defense framework, BDFirewall, that removes these triggers from the most conspicuous to the most subtle, without requiring model access. First, for HVTs, which create the most significant local semantic distortions, we identify and eliminate them by detecting these salient differences. We then restore the patched area to mitigate the adverse impact of such removal process. The localized purification designed for HVTs is, however, ineffective against SVTs, which globally perturb benign features. We therefore model an SVT-poisoned input as a mixture of a trigger and benign features, where we unconventionally treat the benign features as "noise". This formulation allows us to reconstruct SVTs by applying a denoising process that removes these benign "noise" features. The SVT-free input is then obtained by subtracting the reconstructed trigger. Finally, to neutralize the nearly imperceptible but fragile LVTs, we introduce lightweight noise to disrupt the trigger pattern and then apply DDPM to restore any collateral impact on clean features. Comprehensive experiments demonstrate that our method outperforms state-of-the-art defenses. Compared with baselines, BDFirewall reduces the Attack Success Rate (ASR) by an average of 33.25%, improving poisoned sample accuracy (PA) by 29.64%, and achieving up to a 111x speedup in inference time. Code will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Defends against black-box backdoor attacks in MLaaS
Classifies and removes triggers by visibility levels
Improves security with high accuracy and speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifies backdoor triggers by visibility impact
Uses progressive defense to remove triggers
Applies denoising and DDPM for subtle triggers
🔎 Similar Papers
No similar papers found.
Y
Ye Li
Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
C
Chengcheng Zhu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Yanchao Zhao
Yanchao Zhao
Nanjing University of Aeronautics and Astronautics
Computer Networks
J
Jiale Zhang
Yangzhou University, Yangzhou, 225127, China