Prediction Inconsistency Helps Achieve Generalizable Detection of Adversarial Examples

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing adversarial example detection methods suffer from poor generalization, struggle to accommodate both naturally and adversarially trained models, and exhibit unstable performance under white-box and black-box attacks. To address these limitations, we propose the PID framework—the first to empirically observe that auxiliary models exhibit significantly reduced prediction confidence on adversarial examples while maintaining high confidence on clean examples. Leveraging this confidence inconsistency between primary and auxiliary models, PID enables gradient-free, non-generative detection. The method is lightweight, computationally efficient, and compatible with diverse training paradigms (natural and adversarial training) and threat models (white-box, black-box, and hybrid attacks). Evaluated on CIFAR-10 and ImageNet, PID achieves AUC scores of 99.29% and 96.81%, respectively—surpassing state-of-the-art methods by 4.70–25.46 percentage points—and demonstrates strong cross-setting robustness.

Technology Category

Application Category

📝 Abstract
Adversarial detection protects models from adversarial attacks by refusing suspicious test samples. However, current detection methods often suffer from weak generalization: their effectiveness tends to degrade significantly when applied to adversarially trained models rather than naturally trained ones, and they generally struggle to achieve consistent effectiveness across both white-box and black-box attack settings. In this work, we observe that an auxiliary model, differing from the primary model in training strategy or model architecture, tends to assign low confidence to the primary model's predictions on adversarial examples (AEs), while preserving high confidence on normal examples (NEs). Based on this discovery, we propose Prediction Inconsistency Detector (PID), a lightweight and generalizable detection framework to distinguish AEs from NEs by capturing the prediction inconsistency between the primal and auxiliary models. PID is compatible with both naturally and adversarially trained primal models and outperforms four detection methods across 3 white-box, 3 black-box, and 1 mixed adversarial attacks. Specifically, PID achieves average AUC scores of 99.29% and 99.30% on CIFAR-10 when the primal model is naturally and adversarially trained, respectively, and 98.31% and 96.81% on ImageNet under the same conditions, outperforming existing SOTAs by 4.70%$sim$25.46%.
Problem

Research questions and friction points this paper is trying to address.

Detecting adversarial examples with weak generalization across models
Improving detection consistency in white-box and black-box attacks
Enhancing adversarial detection for both natural and adversarially trained models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses auxiliary model for prediction inconsistency
Lightweight PID framework detects adversarial examples
Compatible with naturally and adversarially trained models
🔎 Similar Papers
No similar papers found.
Sicong Han
Sicong Han
Xi'an Jiaotong University
deep learningadversarial examples
C
Chenhao Lin
School of Cyber Science and Engineering, Xi’an Jiaotong University, China
Zhengyu Zhao
Zhengyu Zhao
Xi'an Jiaotong University, China
Adversarial Machine LearningComputer Vision
X
Xiyuan Wang
School of Automation Science and Technology, Xi’an Jiaotong University, China
Xinlei He
Xinlei He
Assistant Professor, HKUST(GZ)
Trustworthy Machine LearningSecurityPrivacy
Q
Qian Li
School of Cyber Science and Engineering, Xi’an Jiaotong University, China
C
Cong Wang
Department of Computer Science, City University of Hong Kong, China
Q
Qian Wang
School of Cyber Science and Engineering, Wuhan University, China
C
Chao Shen
School of Cyber Science and Engineering, Xi’an Jiaotong University, China