DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) struggle with long-term Earth observation analysis, as existing works predominantly focus on single- or dual-temporal remote sensing imagery, limiting their capacity to model urban dynamic evolution. Method: We introduce DVL-Suite—the first multi-temporal remote sensing benchmark for long-term urban dynamics understanding—covering 42 U.S. megacities from 2005–2023 at 1.0 m resolution, supporting pixel-level change detection, region-level quantitative analysis, and scene-level urban narrative generation. We propose a multi-granularity evaluation framework, the DVL-Instruct instruction-tuning dataset, and DVLChat—a baseline MLLM jointly capable of question answering and segmentation. Contribution/Results: Comprehensive evaluation of 17 state-of-the-art MLLMs reveals critical deficiencies in long-sequence temporal modeling and quantitative reasoning. DVLChat achieves superior performance across all seven tasks of DVL-Bench, enabling, for the first time, interactive, cross-scale interpretation of urban dynamics.

Technology Category

Application Category

📝 Abstract
Multimodal large language models have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 15,063 high-resolution (1.0m) multi-temporal images spanning 42 megacities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes seven urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 17 state-of-the-art multimodal large language models and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' limitations in long-term urban dynamic analysis
Introducing DVL-Suite for multi-temporal remote sensing imagery analysis
Enhancing models' Earth observation via DVL-Instruct dataset tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

DVL-Suite framework for urban dynamics analysis
Specialized DVL-Instruct dataset for instruction-tuning
DVLChat model for image and pixel-level tasks
🔎 Similar Papers
No similar papers found.