MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the performance degradation of open-vocabulary semantic segmentation on remote sensing imagery under adverse weather conditions such as cloud and haze. To mitigate this issue, the authors propose a multimodal fusion framework that integrates optical and synthetic aperture radar (SAR) data. By introducing a cross-modal unified representation alignment mechanism and a dual-encoder fusion module, the method effectively bridges the domain gap between optical and SAR modalities, combining spectral semantics with SAR’s penetrative structural information. Furthermore, it leverages multi-level features from vision foundation models and aligns them with textual embeddings to enhance the dense prediction capability of vision-language models. Experimental results demonstrate that the proposed approach significantly improves segmentation robustness and generalization across various cloud and haze conditions, outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary segmentation

remote sensing

adverse weather conditions

multimodal fusion

cloud contamination

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

open-vocabulary segmentation

Optical-SAR alignment

vision-language model

remote sensing

🔎 Similar Papers

No similar papers found.

Authors to Follow