๐ค AI Summary
This work addresses the challenge of accurately aligning ground-level images with satellite imagery in the presence of large viewpoint discrepancies or unreliable GPS signals. To this end, the authors propose a zero-shot, geometry-driven cross-view geolocalization framework that fuses multi-view ground images to generate a consistent nadir view for matching against satellite imagery. The approach integrates Structure-from-Motion (SfM) reconstruction, 3D Gaussian Splatting, semantic anchoring, and monocular depth cues, enabling geometrically centered alignment without any paired supervisionโa first in the field. As part of this contribution, the authors introduce MC-Sat, the first systematic benchmark dataset for cross-view geolocalization. Evaluated under a zero-shot setting, the method achieves sub-30-meter localization accuracy across both dense urban and large-scale scenes.
๐ Abstract
Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.