EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
Existing remote sensing multimodal large language models (RS MLLMs) suffer from coarse-grained interpretation, limited interaction modalities, and fragmented spatial task modeling when processing heterogeneous imagery (optical/SAR/infrared). To address these limitations, we propose EarthGPT-X—the first RS MLLM enabling unified understanding and interactive reasoning over multi-source remote sensing imagery. It introduces a novel cross-domain single-stage fusion training paradigm and a pixel-aware module that, within a unified visual prompting framework, jointly models referring segmentation and object localization for the first time. Moreover, EarthGPT-X supports scalable, multi-granularity spatial reasoning—from scene-level to pixel-level. Extensive experiments demonstrate that EarthGPT-X significantly outperforms state-of-the-art methods on diverse multi-granularity remote sensing understanding benchmarks. It exhibits strong cross-modal generalization and flexible spatial interaction capabilities, thereby advancing RS MLLMs toward practical deployment.

Technology Category

Application Category

📝 Abstract
Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.
Problem

Research questions and friction points this paper is trying to address.

Adapting natural spatial models to remote sensing domain challenges
Overcoming narrow interpretation levels in current RS MLLMs
Unifying referring and grounding tasks in visual prompting framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal content integration for RS imagery
Cross-domain one-stage fusion training strategy
Pixel perception module unifies referring and grounding
🔎 Similar Papers
No similar papers found.
W
Wei Zhang
Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, and School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
Miaoxin Cai
Miaoxin Cai
Beijing Institute of Technology
remote sensing MLLM
Y
Yaqian Ning
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
T
Tong Zhang
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing 100081, China
Z
Zhuang Yin
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing 100081, China
He Chen
He Chen
Chinese University of Hong Kong
Mathematical Programming
J
Jun Li
School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China
X
Xuerui Mao
Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, and School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China, and Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing 314003, China