SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling implicit spatial semantic queries—such as localizing target regions described in natural language—from remote sensing imagery remains challenging due to difficulties in capturing spatial context, domain-specific knowledge, and user intent. Method: This paper introduces “geospatial pixel reasoning,” a novel task enabling direct generation of pixel-level masks from natural language instructions. To support this, we construct EarthReason, the first large-scale benchmark comprising 5,434 remote sensing images and over 30,000 implicit question-answer pairs. We further propose a domain-adaptive architecture featuring a hierarchical vision encoder, visual token compression, multi-scale language-vision fusion, and description-embedding–driven mask generation. Contribution/Results: Our method achieves state-of-the-art performance on both reasoning-based segmentation and referring segmentation tasks, significantly outperforming conventional methods and existing LLM-driven approaches. This work establishes a new paradigm for remote sensing understanding.

Technology Category

Application Category

📝 Abstract
Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.
Problem

Research questions and friction points this paper is trying to address.

Geospatial pixel reasoning for complex implicit queries
Handling spatial context and domain knowledge in remote sensing
Generating target region masks from implicit question-answer pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical visual encoder for ultra-high-resolution images
LLM instruction parsing with description projection
Direct mask prediction from description embeddings
🔎 Similar Papers
No similar papers found.
Kaiyu Li
Kaiyu Li
Wilfrid Laurier University, Canada
Data governance and Data preparationData market and Data economy
Zepeng Xin
Zepeng Xin
Xi'an Jiaotong University
Deep Learning
L
Li Pang
Xi’an Jiaotong University
C
Chao Pang
Wuhan University
Yupeng Deng
Yupeng Deng
aircas
J
Jing Yao
Chinese Academy of Sciences
G
Guisong Xia
Wuhan University
Deyu Meng
Deyu Meng
Professor, Xi'an Jiaotong University
Machine LearningApplied MathematicsComputer VisionArtificial Intelligence
Z
Zhi Wang
Xi’an Jiaotong University
X
Xiangyong Cao
Xi’an Jiaotong University