ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing training-free open-vocabulary remote sensing segmentation methods suffer from inconsistent predictions and limited generalization due to their reliance on independent inference strategies that neglect strong spatial and semantic correlations within images. To address this, this work proposes the first training-free, context-aware reasoning framework that explicitly models semantic dependencies through a joint inference mechanism across spatial regions and integrates global contextual information to enhance segmentation consistency and robustness. Built upon vision-language models, the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, achieving average performance gains of 2.80% in open-vocabulary semantic segmentation and 6.13% in object extraction tasks.
📝 Abstract
Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary remote sensing segmentation
context-aware inference
spatial correlation
semantic dependency
training-free segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware inference
open-vocabulary segmentation
remote sensing
vision-language models
spatial-semantic correlation
W
Wenyang Chen
Yunnan Normal University
Z
Zhanxuan Hu
Yunnan Normal University
Y
Yaping Zhang
Yunnan Normal University
Hailong Ning
Hailong Ning
University of Illinois Urbana Champaign
batteriesenergy storageenergy harvestingfoam
Yonghang Tai
Yonghang Tai
Deakin University