Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

📅 2024-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of large intra-class variation and scarce labeled data in fine-grained few-shot image classification—which degrade model discriminability—this paper proposes a vision-guided adaptive prompt tuning method. Our approach integrates CLIP and ViT architectures to jointly optimize visual and textual representations. Key contributions include: (1) a novel vision-driven dynamic text prompt generation mechanism, where cross-attention dynamically aligns image patches with textual prompts; and (2) the first coupling of cross-attention with Monte Carlo Dropout to simultaneously enhance classification accuracy and enable well-calibrated uncertainty estimation. Evaluated on three fine-grained benchmarks—CUBirds, Oxford Flowers, and FGVC Aircraft—the method significantly outperforms mainstream baselines including CoOp and VPT, achieving an average accuracy improvement of 5.2%. It thus delivers both strong discriminative capability and reliable, uncertainty-aware predictions.

Technology Category

Application Category

📝 Abstract
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.
Problem

Research questions and friction points this paper is trying to address.

Computer Vision
Image Classification
Limited Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Prompt Tuning
Dynamic Cross-Attention
Monte Carlo Dropout
🔎 Similar Papers
No similar papers found.
E
Eric Brouwer
Faculty of Science and Engineering, University of Groningen, Nijenborgh 9, 9747 AG, Groningen, the Netherlands
J
Jan Erik van Woerden
TNO, Oude Waalsdorperweg 63, 2597 AK, Den Haag, the Netherlands
G
G. Burghouts
TNO, Oude Waalsdorperweg 63, 2597 AK, Den Haag, the Netherlands
M
Matias Valedenegro-Toro
Faculty of Science and Engineering, University of Groningen, Nijenborgh 9, 9747 AG, Groningen, the Netherlands
Marco Zullich
Marco Zullich
University of Groningen
Deep learning