Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cross-domain few-shot segmentation (CD-FSS), component-level feature cross-comparison within Vision Transformer (ViT) architectures induces source-domain pattern entanglement, severely hindering effective knowledge transfer. This work is the first to identify and characterize this entanglement mechanism. We propose a “Self-Decoupling and Re-Composition” framework: (1) ViT-driven self-supervised decoupling modeling to disentangle domain-specific patterns; (2) learnable, component-wise weighted similarity computation for adaptive decoupling matching; and (3) a cross-domain dynamic feature recombination module to enhance target-domain adaptation. On the PASCAL-5i benchmark, our method achieves absolute mIoU gains of +1.92% (1-shot) and +1.88% (5-shot) over prior state-of-the-art. Our core contribution lies in architecturally diagnosing and mitigating pattern entanglement in ViTs—establishing an interpretable, learnable decoupling paradigm for CD-FSS.

Technology Category

Application Category

📝 Abstract
Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a source-domain dataset to unseen target-domain datasets with limited annotations. Current methods typically compare the distance between training and testing samples for mask prediction. However, we find an entanglement problem exists in this widely adopted method, which tends to bind sourcedomain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.
Problem

Research questions and friction points this paper is trying to address.

Addressing entanglement in cross-domain few-shot segmentation
Disentangling and re-composing ViT components for better transfer
Improving generalization and finetuning in CD-FSS tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose ViT structure for interpretation
Learn to weigh ViT component comparisons
Disentangle and re-compose features for CD-FSS
🔎 Similar Papers
No similar papers found.
Jintao Tong
Jintao Tong
Huazhong University of Science and Technology
large multimodal modelfew-shot learning
Yixiong Zou
Yixiong Zou
Huazhong University of Science and Technology
Computer visionDomain generalizationFew-shot learningVision-language model
Guangyao Chen
Guangyao Chen
Cornell University
Open-world LearningAutonomous AgentAI for Science
Y
Yuhua Li
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
R
Ruixuan Li
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China