🤖 AI Summary
To address the degradation of prior prompt information and reduced generalization caused by layer-wise replacement of deep continuous prompts in vision-language models, this paper proposes Modular Prompt Learning (MPL). MPL introduces, for the first time, a modular prompt architecture that is both accumulative and reusable: it explicitly preserves and reuses prompts from preceding Transformer layers, departing from conventional single-layer overwrite paradigms. By employing parameter tying and hierarchical prompt retention, MPL optimizes only lightweight prompt parameters while keeping the backbone frozen. Evaluated on 11 benchmark datasets, MPL achieves an average 0.7% improvement in base-to-novel class generalization, with a notable 10.7% gain on EuroSAT—outperforming existing prompt learning methods. It effectively mitigates prompt decay and enhances cross-dataset and zero-shot transfer stability.
📝 Abstract
Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily adapts them to new scenarios. Compared to fine-tuning, prompt learning enables the model to achieve comparable or better performance using fewer trainable parameters. Besides, prompt learning freezes the pre-trained model and avoids the catastrophic forgetting issue in the fine-tuning. Continuous prompts inserted into the input of every transformer layer (i.e. deep prompts) can improve the performances of pre-trained models on downstream tasks. For i-th transformer layer, the inserted prompts replace previously inserted prompts in the $(i-1)$-th layer. Although the self-attention mechanism contextualizes newly inserted prompts for the current layer and embeddings from the previous layer's output, removing all inserted prompts from the previous layer inevitably loses information contained in the continuous prompts. In this work, we propose Modular Prompt Learning (MPL) that is designed to promote the preservation of information contained in the inserted prompts. We evaluate the proposed method on base-to-new generalization and cross-dataset tasks. On average of 11 datasets, our method achieves 0.7% performance gain on the base-to-new generalization task compared to the state-of-the-art method. The largest improvement on the individual dataset is 10.7% (EuroSAT dataset).