Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing remote sensing multimodal large language models, which are often constrained by sensor types and task scopes, hindering unified understanding of Earth observation data. The authors propose a 2-billion-parameter autoregressive multimodal large language model that integrates six sensor modalities—optical, SAR, infrared, multispectral, time-series, and video—and supports nine diverse tasks. To tackle challenges in multimodal alignment, heterogeneous output unification, and cross-domain adaptation, the model incorporates three key mechanisms: Full-Granularity Vision–Language Alignment (FGVLA), Spatial–Language Isomorphic Serialization (SLIS), and Progressive Cross-Modal Adaptation (PCMA). Trained on the MMRS-OneVision dataset comprising 34 million question–answer pairs, the model achieves competitive or superior performance compared to larger models across multiple benchmarks, including OPT-RSVG (P@0.5: 87.52%), SARLANG-Bench (80.68%), BigEarthNet-MS (recall: 75.74%), and EarthMind-Bench (81.94%).
📝 Abstract
RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.
Problem

Research questions and friction points this paper is trying to address.

Remote Sensing
Multimodal Large Language Models
Sensor Modalities
Cross-modal Fusion
Earth Observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal remote sensing
vision-language alignment
cross-modality fusion
autoregressive framework
domain adaptation
🔎 Similar Papers
No similar papers found.