Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underutilization of pre-trained open-source vision-language models (VLMs) in semi-supervised few-shot learning (SSFSL). We identify a fundamental limitation: over-smoothed softmax outputs from VLMs lead to unreliable pseudo-labels, undermining SSFSL performance. To resolve this, we propose SWIFT—a staged fine-tuning and temperature-optimization framework. SWIFT integrates self-supervised principles via a task-aware VLM classifier initialization strategy and a learnable temperature-scaling mechanism. It jointly leverages three complementary data sources: task-relevant but noisy retrieval-based samples, abundant unlabeled images, and a minimal set of labeled examples. Evaluated on five SSFSL benchmarks, SWIFT achieves an average +5.0% accuracy gain over prior methods, attaining performance close to fully supervised fine-tuning and substantially outperforming existing few-shot and semi-supervised approaches.

Technology Category

Application Category

📝 Abstract
Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!
Problem

Research questions and friction points this paper is trying to address.

Enhance semi-supervised few-shot learning for auto-annotation tasks
Address underutilization of unlabeled data in vision-language model fine-tuning
Improve pseudo-label confidence via classifier initialization and temperature tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetune Vision-Language Models with classifier initialization
Apply temperature tuning to boost pseudo-label confidence
Stage-Wise Finetuning leverages unlabeled and noisy pretraining data
🔎 Similar Papers
No similar papers found.