π€ AI Summary
This work proposes a reinforcement learning approach to enhance the robustness of image classifiers against gradient-based adversarial attacks, which exploit model gradients to efficiently craft perturbations that severely compromise deep neural network security. By training classifiers using policy gradients combined with Ξ΅-greedy exploration, the method implicitly disrupts the gradient structure relied upon by attackers, rendering gradient directions unstable and magnitudes diminished, thereby hindering adversarial optimization. This study is the first to demonstrate that reinforcement learning can serve as an implicit regularizer at the gradient level and integrates it with adversarial training to form a dual-layer defense mechanism. Experiments show that the proposed RL-trained models significantly reduce the success rates of attacks such as PGD, and the combined RL-adv framework achieves state-of-the-art robustness across CIFAR-10, CIFAR-100, and ImageNet-100 against diverse attack strategies, substantially outperforming conventional combinations of supervised learning and adversarial training.
π Abstract
Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.