🤖 AI Summary
This work addresses the challenge of cross-view geo-localization for unmanned aerial vehicles in GNSS-denied environments, where achieving both semantic robustness and fine-grained spatial detail remains difficult. To this end, the authors propose an efficient and high-precision approach that leverages a LoRA-finetuned DINOv3 backbone to extract multi-scale features, introduces a novel semantic-gated residual fusion module to bridge the gap between semantic and spatial information, and incorporates Mamba-based sequence modeling—used here for the first time in this domain—to capture long-range dependencies with linear computational complexity. The method achieves state-of-the-art performance on both the University-1652 and DenseUAV benchmarks, notably improving Recall@1 by 3.48% on DenseUAV.
📝 Abstract
Cross-view geo-localization (CVGL) is critical for Unmanned Aerial Vehicle (UAV) self-positioning and target localization in GNSS-denied environments. However, acquiring robust semantics while preserving finegrained spatial details remains challenging. To address this, we propose DINO-GFSA, a framework leveraging a LoRA (Low-Rank Adaptation) adapted DINOv3 (ViTL) backbone for parameter-efficient, high-capacity representation. Crucially, we introduce a Semantic Gated Residual Fusion module, which utilizes high-level semantics to selectively calibrate and integrate low-level spatial cues, effectively bridging the semantic gap. Furthermore, a Mamba-based Sequential Aggregation Head is designed to capture long-range spatial dependencies with linear complexity. Experiments demonstrate state-of-the-art performance on University-1652 and DenseUAV benchmarks, notably surpassing the previous best on DenseUAV by 3.48% on Recall@1. These results validate DINO-GFSA as a generalized, robust solution for UAV CVGL.