🤖 AI Summary
This work systematically evaluates the transferability of state-of-the-art black-box adversarial attacks (e.g., QEBA, Square) against highly robust models—top-performing defenses on RobustBench—on ImageNet. We find that existing black-box methods achieve success rates below 5% on these strong defenses, dropping over 60 percentage points relative to AutoAttack’s white-box baseline. Crucially, we are the first to empirically demonstrate that high white-box robustness inherently confers black-box transfer robustness. We introduce the concept of “robustness alignment” and rigorously establish its strong negative correlation with transfer attack success (Pearson’s *r* < −0.87). Leveraging this insight, we propose a new black-box attack evaluation benchmark tailored for medium- to high-strength defenses. Our findings provide both theoretical grounding and empirical evidence for understanding the generalization of adversarial robustness and for developing more effective black-box attacks.
📝 Abstract
Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard). In this paper, we question this lack of attention from black-box attacks to robust models. We establish a framework to evaluate the effectiveness of recent black-box attacks against both top-performing and standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibits enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays a key factor in the success rate of transfer-based attacks