🤖 AI Summary
This work addresses the underutilization of pre-trained open-source vision-language models (VLMs) in semi-supervised few-shot learning (SSFSL). We identify a fundamental limitation: over-smoothed softmax outputs from VLMs lead to unreliable pseudo-labels, undermining SSFSL performance. To resolve this, we propose SWIFT—a staged fine-tuning and temperature-optimization framework. SWIFT integrates self-supervised principles via a task-aware VLM classifier initialization strategy and a learnable temperature-scaling mechanism. It jointly leverages three complementary data sources: task-relevant but noisy retrieval-based samples, abundant unlabeled images, and a minimal set of labeled examples. Evaluated on five SSFSL benchmarks, SWIFT achieves an average +5.0% accuracy gain over prior methods, attaining performance close to fully supervised fine-tuning and substantially outperforming existing few-shot and semi-supervised approaches.
📝 Abstract
Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!