🤖 AI Summary
Existing equivariant learning methods struggle to model image invariance and equivariance under unseen transformations, hindering tasks requiring transformation-sensitive information—such as object localization and fine-grained classification. To address this, we propose Self-supervised Transformation Learning (STL), a novel framework that estimates transformation representations implicitly from image pairs without manual transformation labels, thereby decoupling transformation dependence and supporting complex augmentations (e.g., AugMix). STL synergistically integrates equivariant representation modeling with contrastive learning to jointly optimize both invariance and equivariance. Evaluated on 11 benchmarks, STL surpasses state-of-the-art methods on 7 tasks, with particularly notable gains in detection performance. It is compatible with mainstream backbone architectures and its implementation is publicly available.
📝 Abstract
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.