🤖 AI Summary
Open-vocabulary semantic segmentation faces two key challenges: (1) domain shift between image and text embeddings, and (2) insufficient modeling of shallow-level semantics, leading to poor segmentation accuracy for small objects and fine-grained structures. To address these, we propose a Dual-Prompt Collaborative Alignment framework. First, we construct a dual-prompt ontology—comprising class-wise and token-wise prompts—to generate transferable, fine-grained textual representations. Second, we introduce a semantics-guided prompt refinement mechanism that jointly optimizes visual prompt encoding and cross-modal alignment while enabling multi-level feature fusion. Our method integrates CLIP’s textual priors, visual prompt encoding, ontology-based modeling, semantics-guided decoding, and iterative prompt optimization. Evaluated on Pascal-Context and COCO-Stuff, it achieves new state-of-the-art performance: +4.2% mIoU on unseen categories and notably improved segmentation accuracy for small objects.
📝 Abstract
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.