🤖 AI Summary
Existing remote sensing image captioning methods predominantly operate at the coarse-grained image level, failing to capture object-level semantics and structural details. To address this, we propose Geo-DLC—the first object-level fine-grained captioning task for remote sensing—and introduce DE-Dataset, a large-scale benchmark with precise object-level attribute and contextual annotations, alongside the domain-specific evaluation framework DE-Benchmark. We further design DescribeEarth, a multimodal large language model tailored for remote sensing, incorporating a scale-adaptive focal mechanism and a domain-guided fusion module to jointly model high-resolution visual details and geospatial semantic priors. Experiments demonstrate that DescribeEarth consistently outperforms general-purpose multimodal LLMs on DE-Benchmark, achieving significant gains in factual accuracy, descriptive richness, and grammatical correctness. Notably, it exhibits robust performance across simple scenes, complex scenes, and out-of-distribution remote sensing imagery.
📝 Abstract
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.