MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

πŸ“… 2026-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance degradation of open-vocabulary semantic segmentation on remote sensing imagery under adverse weather conditions such as cloud and haze. To mitigate this issue, the authors propose a multimodal fusion framework that integrates optical and synthetic aperture radar (SAR) data. By introducing a cross-modal unified representation alignment mechanism and a dual-encoder fusion module, the method effectively bridges the domain gap between optical and SAR modalities, combining spectral semantics with SAR’s penetrative structural information. Furthermore, it leverages multi-level features from vision foundation models and aligns them with textual embeddings to enhance the dense prediction capability of vision-language models. Experimental results demonstrate that the proposed approach significantly improves segmentation robustness and generalization across various cloud and haze conditions, outperforming existing state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary segmentation
remote sensing
adverse weather conditions
multimodal fusion
cloud contamination
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion
open-vocabulary segmentation
Optical-SAR alignment
vision-language model
remote sensing
πŸ”Ž Similar Papers
No similar papers found.