🤖 AI Summary
This work addresses the limitations of global image geolocation in unseen scenes, where performance is hindered by visual diversity and insufficient coverage of reference databases. The authors propose GeoSearch, a novel framework that, for the first time, integrates open-web reverse image search into geolocation by leveraging retrieval-augmented generation (RAG) to combine database-derived coordinates with textual evidence extracted from web pages, thereby enhancing reasoning in large language-and-vision models. To mitigate noise from unreliable web sources, the method employs a two-stage filtering mechanism based on image matching and confidence gating. Evaluated under leakage-aware protocols on the Im2GPS3k and YFCC4k benchmarks, GeoSearch significantly outperforms existing approaches. The code and data are publicly released.
📝 Abstract
Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.