A Survey on Training-free Open-Vocabulary Semantic Segmentation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary semantic segmentation suffers from high annotation costs and the need for task-specific model retraining. This paper systematically surveys zero-shot, fine-tuning-free approaches that leverage pre-trained multimodal models—particularly CLIP—to achieve pixel-level classification for unseen categories. We propose, for the first time, a unified taxonomy of three paradigmatic frameworks: (i) pure CLIP-driven methods, (ii) vision foundation model–assisted approaches (e.g., SAM, diffusion models), and (iii) generative prompt expansion techniques. A structured comparative analysis is conducted across 30+ state-of-the-art methods. Our study identifies shared limitations in cross-domain generalization, fine-grained spatial localization, and prompt robustness. We further articulate several underexplored research directions. The work provides both a theoretical synthesis and practical guidance for developing low-resource, scalable semantic segmentation systems, substantially lowering deployment barriers.

Technology Category

Application Category

📝 Abstract
Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.
Problem

Research questions and friction points this paper is trying to address.

Survey explores training-free open-vocabulary semantic segmentation methods
Leverages existing multi-modal models to avoid costly data annotation
Analyzes CLIP-based, auxiliary visual, and generative approaches for segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages existing multi-modal classification models
Uses training-free open-vocabulary semantic segmentation
Explores CLIP-based and generative methods
🔎 Similar Papers
No similar papers found.
N
Naomi Kombol
Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva
Ivan Martinović
Ivan Martinović
PhD Student, University of Zagreb
Domain AdaptationPanoptic SegmentationOpen-Vocabulary SegmentationSemi-Supervised Learning
S
Sinivsa Segvić
Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva