SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Cross-view object geolocalization suffers from cumulative errors caused by large viewpoint and scale discrepancies, as well as background clutter, between UAV and satellite imagery. To address this, we propose an end-to-end promptable Transformer architecture enabling real-time, click-guided interactive localization. Methodologically: (1) a Swin-Transformer-based dual-view joint encoder fuses multi-scale features; (2) a grid-level sparse Mixture-of-Experts (MoE) module adaptively models local cross-modal representations; (3) a promptable query mechanism coupled with an anchor-free detection head eliminates scale-biased priors; and (4) heatmap-based supervision enables precise coordinate regression. On the UAV-to-satellite geolocalization task, our method achieves 87.51% localization accuracy at IoU=0.25 and 62.50% mIoU—substantially outperforming prior methods such as DetGeo. Ablation studies confirm the complementary effectiveness of each component.

Technology Category

Application Category

📝 Abstract

Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.

Problem

Research questions and friction points this paper is trying to address.

Pinpointing objects across drone and satellite imagery with significant viewpoint differences

Overcoming cumulative errors in traditional multi-stage geo-localization pipelines

Addressing complex background interference and scale variations in cross-view matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Swin-Transformer for joint drone-satellite feature encoding

Employs grid-level Mixture-of-Experts for adaptive expert activation

Implements anchor-free transformer detection head for coordinate regression

🔎 Similar Papers

No similar papers found.