CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

πŸ“… 2024-08-27
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Visual Prompt Tuning (VPT) suffers from limited downstream adaptability and efficiency due to its disruption of the model’s intrinsic self-attention mechanism. To address this, we propose Cross-Attention Visual Prompt Tuning (CVPT), the first VPT framework that explicitly models semantic interactions between visual prompt tokens and image embedding tokens via cross-attention. Furthermore, we introduce a weight-sharing initialization strategy that enhances representational capacity while maintaining extreme parameter efficiency (<0.1% additional parameters). On VTAB-1K, CVPT achieves a 4% average accuracy gain over standard VPT and matches or exceeds the performance and efficiency of leading adapter-based PEFT methods. Extensive experiments across 25 diverse datasets demonstrate CVPT’s strong generalization capability and practical utility.

Technology Category

Application Category

πŸ“ Abstract
In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter's relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT's performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Improves Visual Prompt Tuning performance limitations
Preserves self-attention integrity in vision models
Enhances efficiency without large parameter overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces cross-attention module for prompt-image interaction
Decouples prompts to preserve self-attention integrity
Uses weight-sharing for efficient cross-attention initialization
πŸ”Ž Similar Papers
No similar papers found.
Lingyun Huang
Lingyun Huang
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China
J
Jianxu Mao
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China
Y
Yaonan Wang
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China
J
Junfei Yi
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China
Z
Ziming Tao
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China