🤖 AI Summary
This work addresses the challenge of improving robustness against white-box gradient-based attacks (e.g., APGD, FAB-T). We propose a lightweight, differentiable front-end architecture: a fully convolutional module with skip connections, trained for approximately one epoch using a small learning rate while keeping the backbone classifier frozen. Crucially, this design achieves strong gradient masking within a fully differentiable model—without gradient-breaking components—for the first time. Combined with random ensemble and backend adversarial training, it attains 90.8±2.5% AutoAttack robust accuracy on CIFAR-10; however, under adaptive attacks, performance drops sharply to 18.2±3.6%, exposing critical limitations of standard white-box evaluation protocols. Experiments on CIFAR-100 and ImageNet approach state-of-the-art white-box robust accuracy, underscoring the necessity of adaptive attacks for rigorous robustness assessment.
📝 Abstract
We tested front-end enhanced neural models where a frozen classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training them using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack package, which we attributed to gradient masking. The gradient masking phenomenon is not new, but the degree of masking was quite remarkable for fully differentiable models that did not have gradient-shattering components such as JPEG compression or components that are expected to cause diminishing gradients. Though black box attacks can be partially effective against gradient masking, they are easily defeated by combining models into randomized ensembles. We estimate that such ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet despite having virtually zero accuracy under adaptive attacks. Adversarial training of the backbone classifier can further increase resistance of the front-end enhanced model to gradient attacks. On CIFAR10, the respective randomized ensemble achieved 90.8$pm 2.5$% (99% CI) accuracy under AutoAttack while having only 18.2$pm 3.6$% accuracy under the adaptive attack. We do not establish SOTA in adversarial robustness. Instead, we make methodological contributions and further supports the thesis that adaptive attacks designed with the complete knowledge of model architecture are crucial in demonstrating model robustness and that even the so-called white-box gradient attacks can have limited applicability. Although gradient attacks can be complemented with black-box attack such as the SQUARE attack or the zero-order PGD, black-box attacks can be weak against randomized ensembles, e.g., when ensemble models mask gradients.