Unbiased Semantic Decoding With Vision Foundation Models for Few-Shot Segmentation.

📅 2025-10-06
🏛️ IEEE Transactions on Neural Networks and Learning Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address class bias in few-shot segmentation caused by support-set-dependent prompting, this paper proposes an unbiased semantic decoding framework—the first to achieve class-agnostic decoding without fine-tuning vision foundation models (e.g., SAM). Our method synergistically integrates SAM and CLIP, establishing a dual-enhancement mechanism: global class-level guidance and local pixel-level prompting. We further introduce a learnable vision–text target prompt generator that jointly models support and query sets to improve cross-class semantic consistency. Leveraging CLIP’s language–image alignment capability, our framework provides semantic supplementation at the image level and precise pixel-level guidance, yielding discriminative, task-aware prompt embeddings. Extensive experiments on PASCAL-5^i and COCO-20^i benchmarks demonstrate substantial improvements over state-of-the-art methods, establishing new SOTA performance and validating strong generalization and robustness.

Technology Category

Application Category

📝 Abstract
Few-shot segmentation (FSS) has garnered significant attention. Many recent approaches attempt to introduce the segment anything model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in FSS. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an unbiased semantic decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the contrastive language-image pretraining (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator (VTPG) is proposed by interacting target text embeddings and clip visual features. Without requiring retraining of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information. Experiments on both the PASCAL- $5^{i}$ and COCO- $20^{i}$ show that our proposed method outperforms the existing approaches by a clear margin and achieves new state-of-the-art performances.
Problem

Research questions and friction points this paper is trying to address.

Addresses biased decoding in few-shot segmentation using SAM
Enhances semantic discrimination by integrating CLIP with SAM
Generates target-focused prompts without retraining foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unbiased Semantic Decoding strategy with SAM integration
Feature enhancement using CLIP semantic alignment
Visual-text target prompt generator for embeddings
🔎 Similar Papers
No similar papers found.
J
Jin Wang
School of Control Science and Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, China
B
Bingfeng Zhang
School of Control Science and Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, China
J
Jian Pang
School of Control Science and Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, China
Weifeng Liu
Weifeng Liu
University of Florida
Machine LearningSignal ProcessingKernel adaptive filtering
B
Baodi Liu
School of Control Science and Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, China
H
Honglong Chen
School of Control Science and Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, China