When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient robustness of vision-language models such as CLIP under strong adversarial attacks during inference by proposing Multi-view Adaptive Counterattack (MAC). MAC constructs multiple augmented views of the input image, performs counterattacks in the embedding space, and adaptively modulates the counterattack strength for each view based on its estimated corruption level. The final prediction is obtained through soft-weighted fusion of the multi-view outputs. Requiring no additional training or hyperparameter tuning, MAC substantially enhances model robustness across 20 datasets and diverse attack scenarios while maintaining high inference speed and memory efficiency, thereby overcoming the limitations of existing approaches that rely on a single view and fixed counterattack intensity.

📝 Abstract

Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

test-time counterattack

vision-language models

CLIP

multi-view

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view

adaptive counterattack

test-time robustness