DFR: A Decompose-Fuse-Reconstruct Framework for Multi-Modal Few-Shot Segmentation

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of single- or dual-modal guidance and insufficient modeling of complex semantics in few-shot segmentation (FSS), this paper proposes the Dynamic Multimodal Fusion Framework (DFR). DFR introduces a novel tri-modal (vision, text, audio) decomposition-fusion-reconstruction mechanism: it leverages SAM to generate visual region proposals, integrates hierarchical semantic expansion with audio feature extraction, employs a contrastive cross-modal fusion module for semantic alignment, and adopts a dual-path reconstruction architecture to jointly optimize geometric and semantic consistency. Evaluated on both synthetic and real-world benchmarks, DFR significantly outperforms state-of-the-art methods. Experimental results demonstrate that dynamic multimodal interaction is critical for enhancing the robustness and generalization capability of few-shot segmentation, particularly under limited labeled support.

Technology Category

Application Category

📝 Abstract
This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR's substantial performance improvements over state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Effectively utilizing multi-modal guidance in few-shot segmentation
Overcoming limitations of single or dual-modal paradigms in segmentation
Integrating visual, textual, and audio modalities for semantic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decomposition of multi-modal data
Contrastive fusion for cross-modal consistency
Dual-path reconstruction with adaptive integration
🔎 Similar Papers
No similar papers found.
S
Shuai Chen
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
F
Fanman Meng
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
X
Xiwei Zhang
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
H
Haoran Wei
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
C
Chenhao Wu
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
Qingbo Wu
Qingbo Wu
University of Electronic Science and Technology of China
video codingimage and video quality assessment
H
Hongliang Li
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China