Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the insufficient evaluation of multimodal large language models (MLLMs) on fine-grained visual understanding and spatial reasoning—particularly in complex, high-resolution urban metro map navigation. To this end, we introduce ReasonMap, the first fine-grained visual reasoning benchmark specifically designed for high-resolution city subway maps, encompassing 1,008 question-answer pairs across 30 metropolitan areas. We propose a two-tiered automated evaluation framework assessing both answer correctness and response quality, and systematically evaluate 15 state-of-the-art MLLMs, including both base and reasoning-enhanced variants. Key findings reveal: (1) most models rely heavily on prior knowledge rather than genuine visual perception; (2) performance degrades significantly under visual occlusion; and (3) open-source base models often outperform their reasoning counterparts, whereas closed-source models exhibit the inverse trend—highlighting fundamental structural differences in how reasoning capabilities integrate with visual modality processing.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' fine-grained visual reasoning on transit maps
Assessing spatial reasoning with high-resolution maps and QA pairs
Comparing performance of open-source vs closed-source MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ReasonMap benchmark for fine-grained visual reasoning
Uses high-resolution transit maps from 30 cities
Two-level evaluation pipeline assesses correctness and quality
🔎 Similar Papers
No similar papers found.