🤖 AI Summary
Soft prompt tuning often suffers from catastrophic forgetting of general-purpose knowledge in vision-language models (e.g., CLIP) under few-shot settings, leading to performance worse than zero-shot inference.
Method: We propose Gradient Alignment (GA), a novel optimization mechanism that constrains prompt gradient updates to align with the direction of zero-shot predictions derived from predefined prompts—thereby explicitly preserving task-agnostic, pre-trained knowledge without requiring additional data, regularization, or architectural modifications.
Contribution/Results: GA effectively mitigates overfitting and inter-class interference. It consistently outperforms state-of-the-art prompt-tuning methods across diverse transfer scenarios—including few-shot learning, domain generalization, base-to-novel class adaptation, and cross-dataset transfer—delivering substantial improvements in both generalization stability and accuracy.
📝 Abstract
Thanks to the large pre-trained vision-language models (VLMs) like CLIP [37], we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zero-shot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Prompt-aligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the pre-defined prompt predictions. Extensive experiments under the few-shot learning, domain generalization, base-to-new generalization and cross-dataset transfer settings demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods.