🤖 AI Summary
This work addresses the lack of systematic evaluation of fine-grained capabilities of multimodal large language models (MLLMs) in embodied geolocation tasks. The authors propose the first unified diagnostic benchmark that integrates perception, spatial awareness, commonsense reasoning, and geolocation, built upon an interactive environment derived from 2,207 global street-view panoramic images. This environment supports yaw, pitch, and zoom operations and enables comprehensive assessment through three progressively complex observational settings: single-view, panoramic-view, and embodied-view. Experimental results reveal that while state-of-the-art models excel at high-level geographic semantic inference, they exhibit significant deficiencies in fine-grained visual perception, metric localization accuracy, and cross-view spatial consistency, thereby highlighting a strong interdependence between geolocation proficiency and other cognitive dimensions.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/