🤖 AI Summary
To address the reliance of vision-language models (e.g., CLIP) on computationally expensive offline fine-tuning and their susceptibility to overfitting in few-shot image recognition, this paper proposes a parameter-free online few-shot learning framework. The method introduces two novel attention-based adapters: (1) a Memory Attn-Adapter that dynamically refines class-specific text embeddings via an external memory module; and (2) a Local-Global Attn-Adapter that jointly integrates local visual details and global semantics from support samples to enhance image feature representations. Crucially, the framework requires no parameter updates—neither to the vision nor language encoders—enabling efficient, plug-and-play adaptation. Extensive experiments demonstrate state-of-the-art performance across cross-category and cross-dataset few-shot benchmarks, with significant gains over prior methods. Moreover, it achieves high inference efficiency and maintains compatibility with diverse CLIP backbones, establishing a lightweight, robust paradigm for few-shot adaptation of vision-language models.
📝 Abstract
Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.