Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to adversarial prompt attacks in zero- and few-shot classification. We propose Universal Adversarial Suffixes (UAS)—compact, task- and model-agnostic token sequences (4–10 tokens) that, when appended to any input, significantly degrade both accuracy and confidence calibration across diverse tasks and models. Methodologically, we employ Gumbel-Softmax relaxation for differentiable optimization over discrete token embeddings, incorporate entropy regularization to prevent optimization collapse, apply label-region masking to avoid label leakage, and adopt soft representation training with discrete inference to enhance robustness. Extensive evaluation on Qwen2, Phi, and TinyLlama across sentiment analysis, natural language inference, and commonsense question answering demonstrates that a single UAS achieves high cross-task and cross-model transferability—validating its universality, efficacy, and practical threat relevance.

Technology Category

Application Category

📝 Abstract

Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

Problem

Research questions and friction points this paper is trying to address.

Develops universal adversarial suffixes to attack language models across tasks.

Optimizes suffixes using Gumbel-Softmax relaxation for transferable adversarial prompts.

Reduces accuracy and confidence in zero-shot and few-shot classification settings.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal adversarial suffixes using Gumbel-Softmax relaxation

Differentiable soft suffix training with entropy regularization

Single suffix transfers across tasks and models

🔎 Similar Papers

No similar papers found.