Fine-grained Image Retrieval via Dual-Vision Adaptation

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Fine-grained image retrieval (FGIR) faces challenges in learning discriminative representations and achieving strong generalization. To address these, this paper proposes a Dual-Vision Adaptation framework that operates with a frozen pre-trained Vision Transformer (ViT) backbone. It jointly enables sample-level object-perceptual adaptation—capturing object-centric visual cues—and feature-level in-context adaptation—refining representations via contextual priors—while incorporating discrimination-perception knowledge distillation to enhance encoder discriminability. The method introduces only lightweight, learnable perturbation modules and context adapters, drastically reducing parameter count. Evaluated on three in-distribution and three out-of-distribution fine-grained benchmarks, it achieves state-of-the-art performance across all settings, demonstrating both superior generalization and high retrieval efficiency.

Technology Category

Application Category

📝 Abstract

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving discriminative visual representations for fine-grained image retrieval

Addressing overfitting in FGIR by leveraging pre-trained knowledge

Balancing retrieval efficiency and performance with knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Perceptual Adaptation modifies input samples

In-Context Adaptation introduces small feature parameters

Discrimination Perception Transfer uses knowledge distillation

🔎 Similar Papers

No similar papers found.

Authors to Follow