🤖 AI Summary
This work addresses a critical limitation in open-set test-time adaptation (TTA): while existing methods such as SAR, OSTTA, UniEnt, and SoTTA improve in-distribution (InD) accuracy, they generally fail to effectively detect out-of-distribution (OOD) samples, and their OOD filtering mechanisms exhibit limited efficacy. Through systematic evaluation on CIFAR-10-C and ImageNet-C benchmarks alongside corresponding OOD datasets (e.g., SVHN-C and ImageNet-O-C), the study reveals a significant imbalance between InD and OOD performance and demonstrates insufficient robustness under varying OOD proportions and distribution shifts. To mitigate this issue, the authors propose a novel baseline employing sigmoid-based multi-label outputs that explicitly models the InD/OOD trade-off, establishing a more robust evaluation framework and a clearer direction for future research in open-set TTA.
📝 Abstract
Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.