Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses a critical limitation in existing adversarial robustness evaluations: their neglect of the computational cost disparities among attack strategies, which undermines realistic risk assessment. To bridge this gap, the work proposes the first computation-aware evaluation framework grounded in cumulative floating-point operations (FLOPs). It introduces risk-computation curves and two aggregate metrics to systematically quantify attack risk across varying computational budgets. Extensive experiments spanning multiple models, training stages, harm categories, and attack types—including gradient-based, iterative, and template attacks—reveal several key insights: the impact of alignment training on robustness is non-monotonic in computational space; model scaling effectively mitigates high-cost gradient attacks but remains vulnerable to low-cost template attacks; cross-model attack transferability persists; attack costs vary by up to 5× across harm categories; and while safety-focused reinforcement learning generally elevates attack costs, certain categories remain notably susceptible.

📝 Abstract

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

compute-aware evaluation

attack cost

language models

jailbreak

Innovation

Methods, ideas, or system contributions that make the work stand out.

compute-aware evaluation

adversarial robustness

risk-compute curves