When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of erroneous geolocation in global image geolocalization caused by visually similar scenes at distinct geographic locations. To overcome this limitation, the authors propose TransGeoCLIP, a novel framework that integrates a location-aware attention mechanism with multimodal large language models to construct a unified image–text–GPS embedding space. The method innovatively employs Transformer-based positional encoding to explicitly model GPS coordinates, thereby enhancing the discriminative capacity for distinguishing geographically distinct yet visually similar scenes. Furthermore, retrieval-augmented reasoning is introduced to refine localization accuracy. Evaluated on IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k benchmarks, TransGeoCLIP achieves state-of-the-art performance, improving street-level localization accuracy within 1 kilometer by 1.5%, 1.07%, 7.18%, and 9.75%, respectively, over existing best methods.

📝 Abstract

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

Problem

Research questions and friction points this paper is trying to address.

image geo-localization

visual similarity

geographic mislocalization

global scale

location reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

location attention mechanism

large multimodal models

image geo-localization