🤖 AI Summary
This work addresses the limited robustness of current vision-language models under adversarial perturbations and the high computational cost of existing test-time adaptation methods, which typically rely on numerous augmented views. The authors propose SS-TPT, a novel approach that introduces dual criteria—stability, defined as prediction invariance under weak augmentations, and suitability, measured by feature-space density—to dynamically evaluate and select high-quality augmented views. These selected views guide prompt tuning and weighted prediction, enabling significant improvements in model robustness and generalization while maintaining low inference overhead. Extensive experiments demonstrate that SS-TPT achieves superior trade-offs between robustness and throughput across multiple benchmarks.
📝 Abstract
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.