🤖 AI Summary
Existing remote sensing instance segmentation methods are largely confined to closed-vocabulary settings, exhibiting limited generalization to novel categories or cross-dataset scenarios. To address this, we propose an open-vocabulary remote sensing instance segmentation framework that innovatively incorporates multi-granularity scene context modeling: region-aware fusion enhances local object discriminability, global context adaptation improves vision-language alignment, and a CLIP-driven open-vocabulary learning mechanism enables zero-shot category recognition. This work is the first to systematically integrate deep scene contextual reasoning into the open-vocabulary segmentation paradigm, effectively mitigating challenges posed by complex terrain, seasonal variations, and small-scale objects. Extensive experiments demonstrate state-of-the-art performance across multiple remote sensing benchmarks—including RSIS, RSISS, and RSOD—while significantly improving generalizability and practical utility for large-scale geospatial analysis.
📝 Abstract
Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $ extbf{SCORE}$ ($ extbf{S}$cene $ extbf{C}$ontext matters in $ extbf{O}$pen-vocabulary $ extbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.