🤖 AI Summary
Existing vision-language models (VLMs) rely on strong assumptions—such as fixed test classes and independent and identically distributed (i.i.d.) data—for zero-shot prediction, rendering them fragile under real-world deployment conditions involving dynamic class addition/removal and non-stationary data distributions. To address this, we propose StatA, the first online test-time adaptation (TTA) method specifically designed for VLMs. StatA introduces a statistical anchoring regularization term to constrain text encoder drift under low-resource settings, thereby preserving initial semantic knowledge. It further combines cross-modal feature disentanglement with online gradient constraints to enable robust adaptation to varying class cardinality and non-i.i.d. input batches. Extensive experiments across diverse realistic scenarios demonstrate that StatA significantly outperforms existing TTA approaches, achieving superior adaptation performance while maintaining original zero-shot robustness. The code is publicly available.
📝 Abstract
The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.