ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the lack of systematic evaluation of fine-grained capabilities of multimodal large language models (MLLMs) in embodied geolocation tasks. The authors propose the first unified diagnostic benchmark that integrates perception, spatial awareness, commonsense reasoning, and geolocation, built upon an interactive environment derived from 2,207 global street-view panoramic images. This environment supports yaw, pitch, and zoom operations and enables comprehensive assessment through three progressively complex observational settings: single-view, panoramic-view, and embodied-view. Experimental results reveal that while state-of-the-art models excel at high-level geographic semantic inference, they exhibit significant deficiencies in fine-grained visual perception, metric localization accuracy, and cross-view spatial consistency, thereby highlighting a strong interdependence between geolocation proficiency and other cognitive dimensions.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/
Problem

Research questions and friction points this paper is trying to address.

embodied geo-localization
multimodal large language models
fine-grained evaluation
spatial reasoning
geographic semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied geo-localization
multimodal large language models
active perception
spatial reasoning
diagnostic benchmark
🔎 Similar Papers
No similar papers found.