🤖 AI Summary
Cross-view object geolocalization suffers from cumulative errors caused by large viewpoint and scale discrepancies, as well as background clutter, between UAV and satellite imagery. To address this, we propose an end-to-end promptable Transformer architecture enabling real-time, click-guided interactive localization. Methodologically: (1) a Swin-Transformer-based dual-view joint encoder fuses multi-scale features; (2) a grid-level sparse Mixture-of-Experts (MoE) module adaptively models local cross-modal representations; (3) a promptable query mechanism coupled with an anchor-free detection head eliminates scale-biased priors; and (4) heatmap-based supervision enables precise coordinate regression. On the UAV-to-satellite geolocalization task, our method achieves 87.51% localization accuracy at IoU=0.25 and 62.50% mIoU—substantially outperforming prior methods such as DetGeo. Ablation studies confirm the complementary effectiveness of each component.
📝 Abstract
Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.