π€ AI Summary
This work addresses the challenge of jointly modeling semantic and geometric information in pixel-level representation learning, aiming to achieve precise cross-image point correspondence without momentum-based mechanisms. We propose a novel stable contrastive loss that eliminates the conventional momentum teacher-student architecture and enables, for the first time, end-to-end, single-pixel-level joint semantic-geometric representation learning. Our method leverages overcomplete feature maps and pixel-wise contrastive learning, trained self-supervisedly on synthetic 2D/3D environments. Experimental results demonstrate that the learned representations exhibit both strong semantic discriminability and high geometric fidelity, leading to significant improvements in cross-view point matching accuracy. This approach establishes a new paradigm for unsupervised pixel-level alignment, advancing beyond reliance on momentum-based consistency or hand-crafted geometric priors.
π Abstract
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.