🤖 AI Summary
Visual Place Recognition (VPR) commonly relies on local feature re-ranking to improve performance; however, designing task-specific local features is impractical, and motion-sequence constraints hinder generalization. To address this, we propose an Embodiment-constrained Mixture-of-Features (MoF) re-ranking method that fuses multiple pre-trained global features—guided by embodied constraints including GPS priors, temporal continuity, local geometric consistency, and self-similarity—and learns dynamic, input-adaptive weights. We systematically formalize embodied constraints for VPR and introduce a lightweight, learnable weighting mechanism optimized jointly via a multi-metric loss. Leveraging fine-tuned DINOv2 global features, our method achieves a +0.9% improvement over the baseline on Pitts30k, establishes a new state-of-the-art, and incurs only 25 KB of additional parameters and 10 μs per frame computational overhead.
📝 Abstract
Visual Place Recognition (VPR) is a scene-oriented image retrieval problem in computer vision in which re-ranking based on local features is commonly employed to improve performance. In robotics, VPR is also referred to as Loop Closure Detection, which emphasizes spatial-temporal verification within a sequence. However, designing local features specifically for VPR is impractical, and relying on motion sequences imposes limitations. Inspired by these observations, we propose a novel, simple re-ranking method that refines global features through a Mixture-of-Features (MoF) approach under embodied constraints. First, we analyze the practical feasibility of embodied constraints in VPR and categorize them according to existing datasets, which include GPS tags, sequential timestamps, local feature matching, and self-similarity matrices. We then propose a learning-based MoF weight-computation approach, utilizing a multi-metric loss function. Experiments demonstrate that our method improves the state-of-the-art (SOTA) performance on public datasets with minimal additional computational overhead. For instance, with only 25 KB of additional parameters and a processing time of 10 microseconds per frame, our method achieves a 0.9% improvement over a DINOv2-based baseline performance on the Pitts-30k test set.