The Resurgence of GCG Adversarial Attacks on Large Language Models

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates the jailbreaking efficacy of gradient-based adversarial prompting—specifically Greedy Coordinate Gradient (GCG) and its simulated-annealing variant T-GCG—across multi-scale open-source LLMs (Qwen2.5-0.5B, LLaMA-3.2-1B, GPT-OSS-20B), focusing on safety-aligned and reasoning-intensive tasks, particularly code-generation prompts. Results show a sharp decline in attack success rate with increasing model scale, and prefix-level evaluation substantially overestimates true semantic jailbreaking effectiveness. Key contributions: (1) identification of the inference process itself as a novel attack surface; (2) empirical confirmation that code-oriented prompts are significantly more vulnerable than general safety prompts; and (3) validation that while T-GCG improves search robustness, it yields limited semantic gains over standard GCG. Collectively, findings expose fundamental scalability limitations of GCG-style methods and provide empirical grounding for refining evaluation paradigms and designing robust defenses.

Technology Category

Application Category

📝 Abstract

Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GCG adversarial attack effectiveness across different LLM sizes

Assessing vulnerability differences between safety prompts and coding tasks

Investigating simulated annealing augmentation for adversarial search diversification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based adversarial prompting with GCG algorithm

Systematic appraisal of GCG and annealing-augmented T-GCG

Evaluating attack effectiveness across varying model sizes

🔎 Similar Papers

No similar papers found.

Authors to Follow