GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Cross-view geolocalization often suffers from low top-1 accuracy in visually similar scenes, where the correct satellite image fails to rank first. To address this, we propose the first trainable re-ranking framework grounded in interpretable natural language descriptions. Leveraging the zero-shot cross-modal alignment capability of vision-language models (VLMs), our method maps ground-level and satellite images into a shared semantic space and employs natural language guidance to refine feature representations, thereby enhancing matching discriminability. Unlike conventional pixel- or feature-level matching approaches, our method operates at the semantic level, significantly improving fine-grained scene discrimination. Extensive experiments demonstrate state-of-the-art performance on VIGOR, University-1652, and a newly introduced real-world driving dataset—Cross-View UK—with substantial gains in top-1 accuracy. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at https://github.com/CAV-Research-Lab/GeoVLM

Problem

Research questions and friction points this paper is trying to address.

Improving vehicle geolocalisation accuracy via cross-view image matching

Addressing similar-scene challenges in cross-view geo-localisation

Enhancing retrieval performance using interpretable language descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for geo-localisation

Trainable reranking improves match accuracy

Leverages interpretable language descriptions

🔎 Similar Papers

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model