A Survey on Training-free Open-Vocabulary Semantic Segmentation

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Open-vocabulary semantic segmentation suffers from high annotation costs and the need for task-specific model retraining. This paper systematically surveys zero-shot, fine-tuning-free approaches that leverage pre-trained multimodal models—particularly CLIP—to achieve pixel-level classification for unseen categories. We propose, for the first time, a unified taxonomy of three paradigmatic frameworks: (i) pure CLIP-driven methods, (ii) vision foundation model–assisted approaches (e.g., SAM, diffusion models), and (iii) generative prompt expansion techniques. A structured comparative analysis is conducted across 30+ state-of-the-art methods. Our study identifies shared limitations in cross-domain generalization, fine-grained spatial localization, and prompt robustness. We further articulate several underexplored research directions. The work provides both a theoretical synthesis and practical guidance for developing low-resource, scalable semantic segmentation systems, substantially lowering deployment barriers.

Technology Category

Application Category

📝 Abstract

Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.

Problem

Research questions and friction points this paper is trying to address.

Survey explores training-free open-vocabulary semantic segmentation methods

Leverages existing multi-modal models to avoid costly data annotation

Analyzes CLIP-based, auxiliary visual, and generative approaches for segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages existing multi-modal classification models

Uses training-free open-vocabulary semantic segmentation

Explores CLIP-based and generative methods

🔎 Similar Papers

No similar papers found.

Authors to Follow