Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers

📅 2024-01-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the vulnerability of text classifiers to single-word perturbations. First, it formalizes word-level robustness via a quantitative metric ρ. Second, it introduces SP-Attack, an efficient adversarial method that achieves high misclassification rates by substituting only one semantically plausible word. Third, it proposes SP-Defense—the first defense specifically designed for single-perturbation robustness—integrating semantic-preserving data augmentation with fine-tuning of BERT and DistilBERT. Experiments demonstrate that SP-Defense improves ρ by 14.6% and 13.9% on BERT and DistilBERT, respectively, while reducing SP-Attack success rates by 30.4% and 21.2%. Moreover, it significantly degrades the efficacy of multi-word attacks. These results establish that single-perturbation vulnerability is both measurable and mitigable, offering a principled framework for evaluating and enhancing lexical robustness in neural text classifiers.

Technology Category

Application Category

📝 Abstract
In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric {ho} to quantitatively assess a classifier's robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve {ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves {ho} by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.
Problem

Research questions and friction points this paper is trying to address.

Assessing classifier robustness against single-word perturbations
Exploiting single-word vulnerability for efficient adversarial attacks
Improving defense against single-word perturbation attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces metric {ho} for single-word robustness
Develops SP-Attack for efficient adversarial examples
Proposes SP-Defense via data augmentation
🔎 Similar Papers
No similar papers found.
L
Lei Xu
MIT LIDS
Sarah Alnegheimish
Sarah Alnegheimish
MIT
Machine Learning
L
Laure Berti-Équille
IRD
Alfredo Cuesta-Infante
Alfredo Cuesta-Infante
Universidad Rey Juan Carlos
K
K. Veeramachaneni
MIT LIDS