Language-Aware Information Maximization for Transductive Few-Shot CLIP

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses transductive few-shot learning with vision-language models (VLMs) by proposing a language-aware mutual information maximization framework. Methodologically, it introduces information-theoretic principles to transductive CLIP learning for the first time, jointly optimizing visual-text mutual information, consistency between network outputs and text-driven zero-shot predictions (via KL-divergence regularization), and cross-entropy loss on labeled examples—all within an end-to-end training pipeline enabled by parameter-efficient fine-tuning (PEFT). Key contributions are: (1) a language-guided mutual information objective that explicitly enforces cross-modal semantic alignment; and (2) empirical and analytical insights into the critical role of PEFT in transductive few-shot VLM adaptation. The framework achieves state-of-the-art performance across multiple benchmarks, significantly outperforming existing transductive and inductive approaches, thereby demonstrating its effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network's probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model's parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:[ href{https://github.com/ghassenbaklouti/LIMO}{ ext{here}} ]
Problem

Research questions and friction points this paper is trying to address.

Improving transductive few-shot learning for vision-language models
Enhancing CLIP performance with information-theoretic loss functions
Exploring parameter-efficient fine-tuning strategies in few-shot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-aware Information MaximizatiOn loss
Parameter-efficient fine-tuning strategies
Mutual information and KL divergence integration