🤖 AI Summary
This work addresses the limitations of existing remote sensing multimodal large language models, which are often constrained by sensor types and task scopes, hindering unified understanding of Earth observation data. The authors propose a 2-billion-parameter autoregressive multimodal large language model that integrates six sensor modalities—optical, SAR, infrared, multispectral, time-series, and video—and supports nine diverse tasks. To tackle challenges in multimodal alignment, heterogeneous output unification, and cross-domain adaptation, the model incorporates three key mechanisms: Full-Granularity Vision–Language Alignment (FGVLA), Spatial–Language Isomorphic Serialization (SLIS), and Progressive Cross-Modal Adaptation (PCMA). Trained on the MMRS-OneVision dataset comprising 34 million question–answer pairs, the model achieves competitive or superior performance compared to larger models across multiple benchmarks, including OPT-RSVG (P@0.5: 87.52%), SARLANG-Bench (80.68%), BigEarthNet-MS (recall: 75.74%), and EarthMind-Bench (81.94%).
📝 Abstract
RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.