🤖 AI Summary
To address the insufficient robustness of visual place recognition (VPR) in long-term localization under dynamic and perceptually ambiguous conditions, this paper proposes an end-to-end trainable sequential modeling framework. Unlike prevailing single-frame embedding approaches, our method is the first to jointly model spatiotemporal context at the sequence level for descriptor learning: we introduce a differentiable sequence differencing (DSD) operator to capture directional temporal dynamics, and integrate a lightweight 1D convolutional encoder with an LSTM-based refinement module to produce compact, discriminative sequence embeddings. Furthermore, we employ a quadruplet loss to enhance matching performance under large viewpoint changes and severe appearance variations. Extensive evaluations on multiple public benchmarks demonstrate significant improvements over state-of-the-art methods—particularly in challenging scenarios involving seasonal transitions and substantial viewpoint shifts—achieving higher accuracy and superior robustness in long-term VPR.
📝 Abstract
Visual Place Recognition (VPR) in dynamic and perceptually aliased environments remains a fundamental challenge for long-term localization. Existing deep learning-based solutions predominantly focus on single-frame embeddings, neglecting the temporal coherence present in image sequences. This paper presents OptiCorNet, a novel sequence modeling framework that unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module. Central to our approach is a lightweight 1D convolutional encoder combined with a learnable differential temporal operator, termed Differentiable Sequence Delta (DSD), which jointly captures short-term spatial context and long-range temporal transitions. The DSD module models directional differences across sequences via a fixed-weight differencing kernel, followed by an LSTM-based refinement and optional residual projection, yielding compact, discriminative descriptors robust to viewpoint and appearance shifts. To further enhance inter-class separability, we incorporate a quadruplet loss that optimizes both positive alignment and multi-negative divergence within each batch. Unlike prior VPR methods that treat temporal aggregation as post-processing, OptiCorNet learns sequence-level embeddings directly, enabling more effective end-to-end place recognition. Comprehensive evaluations on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.