🤖 AI Summary
This work addresses the limitations of existing test-time adaptation (TTA) methods, which typically assume a single global domain distribution and overlook sample-level, multi-dimensional domain shifts, leading to fragile adaptation. To overcome this, the authors propose DOME, a novel framework that, for the first time, explicitly models continuous sample-level domain variables in a zero-shot manner. DOME leverages vision-language pretraining to extract dense domain representations and employs a momentum-updated sparse domain bank to provide decoupled supervision, subsequently injecting explicit domain information into the downstream model. By moving beyond implicit global domain assumptions, DOME enables structured domain representation. It achieves state-of-the-art performance on ImageNet-C, ImageNet-R, and ImageNet-Sketch, significantly outperforming existing TTA approaches—including more complex ones—and demonstrates the efficacy and robustness of explicit domain modeling.
📝 Abstract
Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.