Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In supervised image classification, coarse-grained class labels often induce label leakage, undermining cross-modal knowledge distillation (KD) performance. To address this, we propose a novel cross-modal KD framework for image classification that leverages CLIP’s visual representations as teacher guidance—bypassing direct use of class names—and introduces a semantic-relaxed text embedding mechanism grounded in the WordNet hierarchy to generate diverse, interpretable textual priors. We further design a hierarchical contrastive distillation loss coupled with cross-modal feature alignment to mitigate textual shortcut effects. Evaluated on six benchmark datasets, our method achieves state-of-the-art or second-best performance, significantly improving student model accuracy, generalization robustness, and reliance on visual features. Empirical results validate that semantic-relaxed embeddings effectively prevent label leakage while enhancing model interpretability.

Technology Category

Application Category

📝 Abstract
Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.
Problem

Research questions and friction points this paper is trying to address.

Enhancing unimodal student models using multimodal teacher knowledge
Mitigating label leakage in crossmodal knowledge distillation
Improving textual cues with WordNet-relaxed embeddings for robust classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-teacher crossmodal KD framework
WordNet-relaxed text embeddings
Hierarchical loss integration
🔎 Similar Papers
No similar papers found.
C
Chenqi Guo
North China Electric Power University
M
Mengshuo Rong
North China Electric Power University
Qianli Feng
Qianli Feng
Amazon
R
Rongfan Feng
North China Electric Power University
Y
Yinglong Ma
North China Electric Power University