TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models struggle to perform complex spatial reasoning in remote sensing tasks due to their inability to leverage precise pixel-level representations. To address this limitation, this work proposes TerraScope, the first pixel-level Earth observation vision-language model that supports modality flexibility (optical/SAR), multi-temporal inputs, and adaptive fusion. TerraScope introduces a chain-of-thought reasoning mechanism grounded in pixel-level mask embeddings to enable interpretable spatial inference. Additionally, the authors construct Terra-CoT, a large-scale dataset comprising one million samples, and TerraScope-Bench, the first pixel-level remote sensing reasoning benchmark. Experiments demonstrate that TerraScope significantly outperforms existing methods on tasks such as change analysis, achieving both high answer accuracy and high-quality mask generation, thereby validating its genuine pixel-level reasoning capability.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Problem

Research questions and friction points this paper is trying to address.

pixel-grounded reasoning

earth observation

geospatial reasoning

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

pixel-grounded reasoning

vision-language model

multi-temporal analysis

modality fusion

geospatial reasoning

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Authors to Follow