🤖 AI Summary
This work addresses the vulnerability of vision-language models such as CLIP to adversarial attacks, where existing test-time defenses often degrade clean-accuracy performance. The authors propose a training-free, plug-and-play gating mechanism that leverages, for the first time, the instability of feature drift in CLIP’s representation space under high-magnitude perturbations as a lightweight signal for adversarial detection. Defense strategies—such as anti-attack or noise anchoring—are dynamically activated only upon detecting adversarial inputs. Grounded in an analysis of feature drift under Gaussian/uniform noise and photometric/geometric transformations, the method significantly improves the trade-off between clean and robust accuracy across 13 datasets: average accuracy on eight fine-grained datasets rises from 65.7%–68.4% to 71.4%–73.2%, and on ImageNet along with four distribution-shifted variants, it increases from 56.1%–62.1% to 66.2%–67.6%.
📝 Abstract
Vision-language models (VLMs) such as CLIP show strong zero-shot generalization but remain highly vulnerable to adversarial attacks. Adversarial training improves robustness but is computationally expensive, motivating test-time defenses. Recent approaches exploit how CLIP's visual representations respond to stochastic perturbations: aggregating predictions across noisy views, constructing Gaussian noise-averaged anchors and interpolating features toward them, or applying counter-perturbations. These strategies improve robustness but often degrade clean accuracy, yielding an unfavorable clean-robust trade-off. We revisit stochastic test-time defenses and identify an underexplored noise-regime transition in CLIP's representation space. Prior work explored perturbations mainly in the weak-noise regime, where adversarial examples can appear unusually stable (false stability). Our analysis shows this reverses as perturbation strength grows: beyond the weak-noise regime, adversarial representations become markedly more unstable than clean ones, giving a clearer separation signal. The transition is consistent across uniform and Gaussian noise, photometric and geometric transforms, datasets, and diverse attacks. It largely disappears in adversarially trained models, suggesting it is tied to the fragile local-basin geometry of adversarial representations in non-robust CLIP. We propose a training-free, plug-in drift-gated mechanism that uses high-noise feature drift as a lightweight gating signal to trigger existing test-time defenses only when adversarial-like instability is detected. Across 13 datasets it consistently improves the clean-robust trade-off. On eight fine-grained datasets, mean clean+adversarial accuracy rises from 65.7% to 71.4% for counterattack defenses and 68.4% to 73.2% for noise-anchoring; on ImageNet and four shifted variants, from 56.1% to 66.2% and 62.1% to 67.6%.