Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing few-shot image classification research predominantly focuses on dual-encoder models (e.g., CLIP), leaving generative-contrastive joint-architecture foundation models—such as CoCa—largely unexplored for efficient adaptation. Method: We propose a comprehensive adaptation spectrum for CoCa, spanning zero-training prototype construction, systematic evaluation of LoRA fine-tuning, SupCon-enhanced hybrid loss, hybrid prototype initialization, and data augmentation sensitivity analysis. Contribution/Results: We identify and name the “augmentation divergence” phenomenon—revealing how aggressive data augmentation harms contrastive learning under low-shot regimes. We establish scalable adaptation principles linking regularization strength, LoRA rank, and sampling strategy. Experiments demonstrate that LoRA+SupCon consistently outperforms CLIP baselines across 1–16-shot settings, significantly improving generalization. This work provides the first empirically grounded PEFT configuration guideline specifically for generative-contrastive multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.

Problem

Research questions and friction points this paper is trying to address.

Adapting CoCa models for few-shot image classification tasks

Evaluating parameter-efficient fine-tuning strategies for multimodal foundation models

Addressing data scarcity challenges in generative-contrastive hybrid model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting CoCa with LoRA for few-shot learning

Using hybrid SupCon loss to improve performance

Optimizing data augmentation for stable fine-tuning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs