Universal Adversarial Triggers

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

150K/year
๐Ÿค– AI Summary
This work proposes a method for generating highly stealthy universal adversarial triggers that overcome the limitations of existing approaches, which often produce syntactically awkward or semantically unnatural phrases that are easily detectable. The proposed approach enforces grammatical plausibility through part-of-speech filtering and enhances semantic fluency by incorporating a language model perplexityโ€“based loss function during optimization. Additionally, adversarial training is integrated to improve model robustness. Evaluated on the SST sentiment analysis task, the generated triggers reduce prediction accuracy to 0.04 and 0.12 for positive and negative classes, respectively. Following adversarial training, model accuracy substantially recovers to 0.48, demonstrating the dual effectiveness of the method in both enhancing attack stealthiness and enabling robust defense.
๐Ÿ“ Abstract
Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.
Problem

Research questions and friction points this paper is trying to address.

universal adversarial triggers
natural language processing
adversarial attacks
sentiment analysis
trigger generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

universal adversarial triggers
perplexity-based loss
POS filtering
natural language attacks
adversarial training
๐Ÿ”Ž Similar Papers
No similar papers found.