Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-teacher knowledge distillation methods rely solely on unimodal (visual) supervision, suffering from limited semantic diversity and poor cross-modal generalization. To address this, we propose the first CLIP-enhanced multi-teacher distillation framework that jointly leverages visual–language alignment capabilities of CLIP. Our method synchronously fuses knowledge from conventional vision teachers and CLIP at both the logits and feature levels, while introducing a multi-prompt textual guidance mechanism to enrich semantic supervision and improve class-wise consistency. This is the first work to enable collaborative cross-modal teacher modeling in knowledge distillation. Extensive experiments demonstrate that our approach significantly improves student model accuracy and robustness across multiple benchmarks—particularly under distribution shifts and input perturbations—outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
Problem

Research questions and friction points this paper is trying to address.

Fusing cross-modal CLIP knowledge with visual teachers for distillation
Enhancing knowledge diversity beyond unimodal visual representations
Improving prediction confidence and semantic consistency in distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses conventional teacher with CLIP logits
Integrates multi-prompt textual guidance features
Enhances logit distribution for semantic consistency
🔎 Similar Papers
No similar papers found.
Amir M. Mansourian
Amir M. Mansourian
Master student, Sharif University of Technology
Computer VisionMachine Learning
A
Amir Mohammad Babaei
Image Processing Lab, Sharif University of Technology, Tehran, Iran
S
Shohre Kasaei
Image Processing Lab, Sharif University of Technology, Tehran, Iran