Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of distinguishing fine-grained attributes and semantically similar subcategories in downstream image–text retrieval (ITR), this paper proposes Dual-path Collaborative Adaptive Reweighting (DCAR), a novel prompt learning framework built upon CLIP. DCAR introduces a dynamic dual-path prompting mechanism—semantic and visual—that jointly refines cross-modal representations. It further incorporates a category-attribute joint reweighting module to adaptively enhance discriminative attribute descriptions, and integrates multi-view negative sampling with category-matching weighted loss to strengthen fine-grained cross-modal alignment. Evaluated on our newly constructed fine-grained dataset FDRD, DCAR achieves state-of-the-art performance, significantly outperforming existing methods. Ablation studies confirm the efficacy of each component in improving retrieval accuracy and fine-grained discrimination capability.

Technology Category

Application Category

📝 Abstract
Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to image-text retrieval tasks
Discriminating fine-grained attributes and similar subcategories
Enhancing fine-grained representation learning via dual-prompt framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual prompt learning for image-text matching
Joint category-attribute reweighting framework
Dynamic prompt vector adjustment
🔎 Similar Papers
No similar papers found.
Y
Yifan Wang
College of Computer Science, Sichuan University, Chengdu, China
T
Tao Wang
College of Computer Science, Sichuan University, Chengdu, China
Chenwei Tang
Chenwei Tang
Sichuan University
neural networkzero-shot learningdeep learning
Caiyang Yu
Caiyang Yu
student of Sichuan University
neural architecture search,swarm intelligence optimization
Z
Zhengqing Zang
College of Computer Science, Sichuan University, Chengdu, China
Mengmi Zhang
Mengmi Zhang
Assistant professor and PI of Deep NeuroCognition Lab, Nanyang Technological University, Singapore
neuroscience-inspired AIcomputer visioncomputational neurosciencecognitive science
S
Shudong Huang
College of Computer Science, Sichuan University, Chengdu, China
Jiancheng Lv
Jiancheng Lv
University of Science and Technology of China
Operations ManagementMarketing