DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary semantic segmentation faces two key challenges: (1) domain shift between image and text embeddings, and (2) insufficient modeling of shallow-level semantics, leading to poor segmentation accuracy for small objects and fine-grained structures. To address these, we propose a Dual-Prompt Collaborative Alignment framework. First, we construct a dual-prompt ontology—comprising class-wise and token-wise prompts—to generate transferable, fine-grained textual representations. Second, we introduce a semantics-guided prompt refinement mechanism that jointly optimizes visual prompt encoding and cross-modal alignment while enabling multi-level feature fusion. Our method integrates CLIP’s textual priors, visual prompt encoding, ontology-based modeling, semantics-guided decoding, and iterative prompt optimization. Evaluated on Pascal-Context and COCO-Stuff, it achieves new state-of-the-art performance: +4.2% mIoU on unseen categories and notably improved segmentation accuracy for small objects.

Technology Category

Application Category

📝 Abstract
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.
Problem

Research questions and friction points this paper is trying to address.

Bridging domain gap between image and text embeddings
Enhancing small object detection with shallow features
Improving segmentation accuracy for unseen categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-prompt cost volume generation for alignment
Cost volume-guided decoder for segmentation
Semantic-guided prompt refinement strategy
🔎 Similar Papers
No similar papers found.