🤖 AI Summary
Deep neural networks are vulnerable to adversarial perturbations; existing robustness methods often rely on architectural modifications or test-time input purification, limiting their generalizability and practicality. To address this, we propose a model-agnostic, test-time intervention-free multi-objective representation learning framework. Our approach is the first to jointly integrate multi-positive contrastive loss with cosine similarity loss in adversarial robustness training, simultaneously optimizing feature alignment and classification accuracy. This encourages natural samples and their corresponding adversarial counterparts to form intra-class compact clusters in the embedding space. Extensive experiments demonstrate that our method significantly improves robustness against both white-box and black-box attacks, outperforming existing architecture-agnostic alternatives. The implementation is publicly available.
📝 Abstract
Extensive research has shown that deep neural networks (DNNs) are vulnerable to slight adversarial perturbations$-$small changes to the input data that appear insignificant but cause the model to produce drastically different outputs. In addition to augmenting training data with adversarial examples generated from a specific attack method, most of the current defense strategies necessitate modifying the original model architecture components to improve robustness or performing test-time data purification to handle adversarial attacks. In this work, we demonstrate that strong feature representation learning during training can significantly enhance the original model's robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations. Our training method involves an embedding space where cosine similarity loss and multi-positive contrastive loss are used to align natural and adversarial features from the model encoder and ensure tight clustering. Concurrently, the classifier is motivated to achieve accurate predictions. Through extensive experiments, we demonstrate that our approach significantly enhances the robustness of DNNs against white-box and black-box adversarial attacks, outperforming other methods that similarly require no architectural changes or test-time data purification. Our code is available at https://github.com/salomonhotegni/MOREL