🤖 AI Summary
Existing test-time adaptation methods rely solely on low-entropy samples, limiting their ability to fully exploit information from the test distribution and leading to performance degradation under distribution shifts. This work proposes DualTTA, a novel framework that introduces prediction stability under semantic-preserving and semantic-altering transformations as a reliability criterion, thereby overcoming the limitations of entropy-based sample selection. By effectively separating reliable and unreliable samples, DualTTA applies dual optimization strategies—minimizing prediction entropy for reliable samples and maximizing it for unreliable ones—guided by a theoretically motivated grouping mechanism. Extensive experiments demonstrate that the proposed method significantly improves model performance across diverse distribution shift scenarios, achieving both strong effectiveness and robustness.
📝 Abstract
Conventional test-time adaptation (TTA) approaches typically adapt the model using only a small fraction of test samples, often those with low-entropy predictions, thereby failing to fully leverage the available information in the test distribution. This paper introduces DualTTA, a novel framework that improves performance under distribution shifts by utilizing a larger and more diverse set of test samples. DualTTA identifies two distinct groups: one where the model's predictions are likely consistent with the underlying semantics, and another where predictions are likely incorrect. For the first group, it minimizes prediction entropy to reinforce reliable decisions; for the second, it maximizes entropy to suppress overconfident errors and unlearn spurious behavior. These groups are adaptively selected using a new reliability criterion that measures prediction stability under both semantic-preserving and semantic-altering transformations, addressing the limitations of purely entropy-based selection. We further provide theoretical analysis and empirical justification showing that our approach enables a tighter separation between reliable and unreliable samples, in the context of their suitability for adaptation, leading to provably more effective model updates.