Modelling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection

📅 2024-12-18

🏛️ ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Multi-label few-shot image classification (ML-FSIC) faces the core challenge of achieving fine-grained alignment between multi-semantic labels and local image regions under extreme label scarcity. To address this, we propose a label-semantics-driven progressive prototype learning framework: (1) initializing and refining label prototypes via word embeddings to encode semantic priors; (2) introducing a loss-change measurement (LCM)-based interpretable local feature selection strategy to enhance robustness against noise in region–label matching; and (3) incorporating a multimodal cross-attention module to model deep interactions between visual local features and semantic label priors. Evaluated on four benchmarks—COCO, PASCAL VOC, NUS-WIDE, and iMaterialist—our method consistently outperforms existing state-of-the-art approaches, achieving average mAP gains of 3.2–5.8 percentage points. To our knowledge, it is the first method to enable semantic-aware local alignment modeling under few-shot conditions.

Technology Category

Application Category

📝 Abstract

The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

Multi-label few-shot image classification

Local feature selection challenge

Refining label prototypes using word embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refine label prototypes gradually

Use word embeddings for initialization

Apply multi-modal cross-interaction mechanism

🔎 Similar Papers

Cross-domain Multi-modal Few-shot Object Detection via Rich Text