🤖 AI Summary
To address the limited robustness and adaptability of vision-language models (VLMs) in few-shot learning, long-tailed classification, and out-of-distribution generalization, this paper proposes **Dropout Prompt Learning**: a token-level dynamic dropout mechanism applied to both textual and visual branches, guided by cross-modal alignment; token importance is jointly estimated via intra-modal contextual modeling and cross-modal similarity assessment. Furthermore, we introduce **residual entropy regularization**, which enhances representation diversity while preserving semantic consistency. The method introduces no additional parameters and is architecture-agnostic, seamlessly integrating with mainstream VLMs. Extensive evaluation across 15 benchmarks demonstrates consistent and significant improvements across all three challenging scenarios: notably, novel-class recognition accuracy improves by 5.10% over KgCoOp and 2.13% over PromptSRC.
📝 Abstract
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.