VLG-Loc: Vision-Language Global Localization from Labeled Footprint Maps

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses global robot localization using semantic footprint maps containing only visual landmark names and coarse region annotations—without requiring geometric or appearance priors. Our method leverages vision-language models (VLMs) to perform cross-modal landmark retrieval between multi-directional observation images and the semantic map; retrieved matches are integrated into a Monte Carlo Localization (MCL) framework, where probabilistic fusion of visual-semantic observations and LiDAR scans refines pose estimation. To our knowledge, this is the first approach enabling robust localization directly from human-readable, minimal footprint maps—eliminating reliance on environmental stability or high-fidelity metric mapping. Evaluated in both simulation and real-world retail environments, our method achieves a 27% improvement in localization accuracy and a 63% reduction in failure rate over conventional scan-matching baselines. It demonstrates strong robustness under challenging conditions including illumination variation, dynamic occlusions, and subtle layout changes.

Technology Category

Application Category

📝 Abstract

This paper presents Vision-Language Global Localization (VLG-Loc), a novel global localization method that uses human-readable labeled footprint maps containing only names and areas of distinctive visual landmarks in an environment. While humans naturally localize themselves using such maps, translating this capability to robotic systems remains highly challenging due to the difficulty of establishing correspondences between observed landmarks and those in the map without geometric and appearance details. To address this challenge, VLG-Loc leverages a vision-language model (VLM) to search the robot's multi-directional image observations for the landmarks noted in the map. The method then identifies robot poses within a Monte Carlo localization framework, where the found landmarks are used to evaluate the likelihood of each pose hypothesis. Experimental validation in simulated and real-world retail environments demonstrates superior robustness compared to existing scan-based methods, particularly under environmental changes. Further improvements are achieved through the probabilistic fusion of visual and scan-based localization.

Problem

Research questions and friction points this paper is trying to address.

Enables robots to localize using labeled maps without geometric details

Uses vision-language models to match observed landmarks with map labels

Improves robustness in changing environments by fusing visual and scan data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language model for landmark matching

Integrates landmark detection into Monte Carlo localization

Fuses visual and scan-based methods probabilistically

🔎 Similar Papers

No similar papers found.