Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

📅 2024-10-09
🏛️ IEEE Robotics and Automation Letters
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited discriminability of global descriptors and suboptimal re-ranking performance in visual place recognition (VPR), this paper proposes the first end-to-end joint learning framework that simultaneously optimizes ViT-based global descriptors and a paired classifier. We introduce a place-aware Siamese masked image modeling pre-training strategy—replacing generic dataset initialization with VPR-specific feature learning—and adopt a class-token-driven dual-branch architecture that unifies descriptor extraction and image-pair similarity classification. Evaluated on five standard VPR benchmarks, our method achieves state-of-the-art performance: ViT-Base surpasses most CNN-based approaches, while ViT-L significantly improves recall under challenging conditions (e.g., viewpoint/illumination changes). These results validate the effectiveness of joint descriptor-classifier optimization and domain-adaptive pre-training for VPR.

Technology Category

Application Category

📝 Abstract
In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.
Problem

Research questions and friction points this paper is trying to address.

Develops a joint training method for Visual Place Recognition (VPR).
Introduces a place-aware pre-training strategy using Siamese Masked Image Modelling.
Achieves state-of-the-art VPR performance across multiple benchmark datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint training with Vision Transformers for VPR
Siamese Masked Image Modelling pre-training
Place-aware image sampling for feature learning