π€ AI Summary
This work addresses the performance degradation of open-vocabulary semantic segmentation on remote sensing imagery under adverse weather conditions such as cloud and haze. To mitigate this issue, the authors propose a multimodal fusion framework that integrates optical and synthetic aperture radar (SAR) data. By introducing a cross-modal unified representation alignment mechanism and a dual-encoder fusion module, the method effectively bridges the domain gap between optical and SAR modalities, combining spectral semantics with SARβs penetrative structural information. Furthermore, it leverages multi-level features from vision foundation models and aligns them with textual embeddings to enhance the dense prediction capability of vision-language models. Experimental results demonstrate that the proposed approach significantly improves segmentation robustness and generalization across various cloud and haze conditions, outperforming existing state-of-the-art methods.
π Abstract
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.