RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven infrared–visible image fusion methods lack effective supervision and evaluation of textual guidance. To address this, we propose RIS-FUSION, the first framework to jointly optimize referring expression segmentation (REFSEG) and text-driven fusion within a cascaded cross-modal architecture. We introduce the LangGatedFusion module to enable fine-grained injection of textual features into the fusion process. To support this research, we release MM-RIS—the first large-scale multimodal benchmark for referring infrared–visible fusion—comprising diverse real-world scenes with pixel-level segmentation masks and natural-language expressions. By explicitly constraining fusion via segmentation masks, our method significantly enhances semantic alignment with target objects and accentuation of salient regions. On MM-RIS, RIS-FUSION achieves a mean Intersection-over-Union (mIoU) 11.2 percentage points higher than current state-of-the-art methods, demonstrating the effectiveness of cross-modal text–image alignment and segmentation-guided fusion.

Technology Category

Application Category

📝 Abstract
Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of goal-aligned supervision for text-driven image fusion
Unifying image fusion and referring segmentation through joint optimization
Creating benchmark for multimodal infrared-visible segmentation with text guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded framework unifies fusion and segmentation
LangGatedFusion module injects text features
Large-scale MM-RIS benchmark dataset created
🔎 Similar Papers
No similar papers found.
S
Siju Ma
Central University of Finance and Economics, Beijing, China
C
Changsiyu Gong
Central University of Finance and Economics, Beijing, China
X
Xiaofeng Fan
Central University of Finance and Economics, Beijing, China
Yong Ma
Yong Ma
Wuhan University
Infrared image processingremote sensing
C
Chengjie Jiang
Tsinghua University, Shenzhen, China