BDFirewall: Towards Effective and Expeditiously Black-Box Backdoor Defense in MLaaS

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In black-box Machine Learning-as-a-Service (MLaaS) settings, detecting and defending against diverse backdoor triggers remains challenging due to lack of model access. To address this, we propose BDFirewall—a model-agnostic, progressive defense framework. BDFirewall innovatively classifies triggers into high-, medium-, and low-visibility categories based on their local influence magnitude, and applies stage-specific purification strategies: saliency-based detection for high-visibility triggers; local denoising guided by benign features for medium-visibility ones; and lightweight noise perturbation coupled with DDPM-based reverse denoising modeling for low-visibility triggers. Extensive experiments demonstrate that BDFirewall reduces attack success rates by 33.25% on average, improves clean accuracy on poisoned samples by 29.64%, and achieves up to 111× faster inference—outperforming existing black-box defenses significantly.

Technology Category

Application Category

📝 Abstract

In this paper, we endeavor to address the challenges of backdoor attacks countermeasures in black-box scenarios, thereby fortifying the security of inference under MLaaS. We first categorize backdoor triggers from a new perspective, i.e., their impact on the patched area, and divide them into: high-visibility triggers (HVT), semi-visibility triggers (SVT), and low-visibility triggers (LVT). Based on this classification, we propose a progressive defense framework, BDFirewall, that removes these triggers from the most conspicuous to the most subtle, without requiring model access. First, for HVTs, which create the most significant local semantic distortions, we identify and eliminate them by detecting these salient differences. We then restore the patched area to mitigate the adverse impact of such removal process. The localized purification designed for HVTs is, however, ineffective against SVTs, which globally perturb benign features. We therefore model an SVT-poisoned input as a mixture of a trigger and benign features, where we unconventionally treat the benign features as "noise". This formulation allows us to reconstruct SVTs by applying a denoising process that removes these benign "noise" features. The SVT-free input is then obtained by subtracting the reconstructed trigger. Finally, to neutralize the nearly imperceptible but fragile LVTs, we introduce lightweight noise to disrupt the trigger pattern and then apply DDPM to restore any collateral impact on clean features. Comprehensive experiments demonstrate that our method outperforms state-of-the-art defenses. Compared with baselines, BDFirewall reduces the Attack Success Rate (ASR) by an average of 33.25%, improving poisoned sample accuracy (PA) by 29.64%, and achieving up to a 111x speedup in inference time. Code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Defends against black-box backdoor attacks in MLaaS

Classifies and removes triggers by visibility levels

Improves security with high accuracy and speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifies backdoor triggers by visibility impact

Uses progressive defense to remove triggers

Applies denoising and DDPM for subtle triggers

🔎 Similar Papers

No similar papers found.

Authors to Follow