Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

To address the reliance of vision-language models (e.g., CLIP) on computationally expensive offline fine-tuning and their susceptibility to overfitting in few-shot image recognition, this paper proposes a parameter-free online few-shot learning framework. The method introduces two novel attention-based adapters: (1) a Memory Attn-Adapter that dynamically refines class-specific text embeddings via an external memory module; and (2) a Local-Global Attn-Adapter that jointly integrates local visual details and global semantics from support samples to enhance image feature representations. Crucially, the framework requires no parameter updates—neither to the vision nor language encoders—enabling efficient, plug-and-play adaptation. Extensive experiments demonstrate state-of-the-art performance across cross-category and cross-dataset few-shot benchmarks, with significant gains over prior methods. Moreover, it achieves high inference efficiency and maintains compatibility with diverse CLIP backbones, establishing a lightweight, robust paradigm for few-shot adaptation of vision-language models.

Technology Category

Application Category

📝 Abstract

Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

Problem

Research questions and friction points this paper is trying to address.

Enhances CLIP's few-shot learning without retraining

Overcomes overfitting from offline prompt fine-tuning

Improves cross-category and cross-dataset generalization performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual attention mechanism enhances CLIP adaptability

Memory adapter refines embeddings using support examples

Local-global adapter integrates features without retraining

🔎 Similar Papers

No similar papers found.