Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

๐Ÿ“… 2024-01-16
๐Ÿ›๏ธ IEEE Transactions on Image Processing
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high cost of expert annotations and the weak discriminability of self-supervised representations in fine-grained visual classification (FGVC), this paper proposes a Cross-level Multi-Instance Distillation (CMID) framework. CMID introduces, for the first time, a collaborative intra-layer and inter-layer multi-instance knowledge distillation mechanism that explicitly models the contribution of discriminative image patches to fine-grained semanticsโ€”thereby overcoming the misalignment between class-agnostic pre-trained features and fine-grained discriminative signals. The framework integrates multi-instance learning, region-to-image crop alignment, self-supervised contrastive learning, and cross-level feature relationship modeling. On CUB-200, Stanford Cars, and FGVC Aircraft, CMID achieves top-1 classification accuracy and Rank-1 retrieval rates surpassing state-of-the-art self-supervised methods by up to 19.78%, significantly enhancing the quality of fine-grained representations.

Technology Category

Application Category

๐Ÿ“ Abstract
High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent investigations find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-trained class-agnostic representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle this challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on several commonly used datasets, including CUB-200-2011, Stanford Cars and FGVC Aircraft, demonstrate that the proposed method outperforms the contemporary methods by up to 10.14% and existing state-of-the-art self-supervised learning approaches by up to 19.78% on both top-1 accuracy and Rank-1 retrieval metric. Source code is available at https://github.com/BiQiWHU/CMD
Problem

Research questions and friction points this paper is trying to address.

Self-supervised learning struggles with fine-grained visual categorization
Existing methods fail to focus on key image patches
Need for better patch importance in representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-level multi-instance distillation framework
Multiple instance learning for patch importance
Intra and inter-level knowledge distillation
๐Ÿ”Ž Similar Papers
No similar papers found.