Fine-grained Image Retrieval via Dual-Vision Adaptation

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained image retrieval (FGIR) faces challenges in learning discriminative representations and achieving strong generalization. To address these, this paper proposes a Dual-Vision Adaptation framework that operates with a frozen pre-trained Vision Transformer (ViT) backbone. It jointly enables sample-level object-perceptual adaptation—capturing object-centric visual cues—and feature-level in-context adaptation—refining representations via contextual priors—while incorporating discrimination-perception knowledge distillation to enhance encoder discriminability. The method introduces only lightweight, learnable perturbation modules and context adapters, drastically reducing parameter count. Evaluated on three in-distribution and three out-of-distribution fine-grained benchmarks, it achieves state-of-the-art performance across all settings, demonstrating both superior generalization and high retrieval efficiency.

Technology Category

Application Category

📝 Abstract
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
Problem

Research questions and friction points this paper is trying to address.

Improving discriminative visual representations for fine-grained image retrieval
Addressing overfitting in FGIR by leveraging pre-trained knowledge
Balancing retrieval efficiency and performance with knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Perceptual Adaptation modifies input samples
In-Context Adaptation introduces small feature parameters
Discrimination Perception Transfer uses knowledge distillation
🔎 Similar Papers
No similar papers found.
X
Xin Jiang
Nanjing University of Science and Technology
Meiqi Cao
Meiqi Cao
Nanjing University Of Science And Technology
H
Hao Tang
Centre for Smart Health, Hong Kong Polytechnic University
Fei Shen
Fei Shen
National University of Singapore
Controllable GenerationMultimodal Safety
Z
Zechao Li
Nanjing University of Science and Technology