🤖 AI Summary
Existing test-time adaptation (TTA) methods often fail or even degrade under realistic conditions—including mixed distribution shifts, small batch sizes, and online label imbalance. To address these challenges, we propose a robust TTA framework with three core contributions: (1) joint sharpness-aware optimization and reliable entropy minimization to enhance generalization stability; (2) redundancy suppression and imbalance-aware regularization to mitigate model collapse and representation degradation; and (3) replacement of batch normalization with group/layer normalization, integrated with gradient filtering, feature decorrelation, and predictive entropy regularization. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art TTA approaches across diverse and challenging test-time scenarios—exhibiting superior robustness, improved convergence stability, and higher computational efficiency—while maintaining low inference overhead.
📝 Abstract
Test-time adaptation (TTA) may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, 3) online imbalanced label distribution shifts. This is often a key obstacle preventing existing TTA methods from being deployed in the real world. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, i.e., group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases, i.e., the model collapses into trivial solutions by assigning the same class label for all samples. By digging into this, we find that, during the collapse process: 1) the model gradients often undergo an initial explosion followed by rapid degradation, suggesting that certain noisy test samples with large gradients may disrupt adaptation; and 2) the model representations tend to exhibit high correlations and classification bias. To address this, we first propose a sharpness-aware and reliable entropy minimization method, called SAR, for stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Based on SAR, we further introduce SAR^2 to prevent representation collapse with two regularizers: 1) a redundancy regularizer to reduce inter-dimensional correlations among centroid-invariant features; and 2) an inequity regularizer to maximize the prediction entropy of a prototype centroid, thereby penalizing biased representations toward any specific class. Promising results demonstrate that our methods perform more stably over prior methods and are computationally efficient under the above wild test scenarios.