Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliance of cross-modal knowledge distillation on costly paired data by proposing the first unpaired cross-modal distillation framework that operates without sample-level alignment. The method enables effective knowledge transfer by jointly aligning the feature distributions and prediction label distributions between teacher and student models, thereby eliminating the need for semantic correspondence at the individual sample level. Theoretical analysis demonstrates that distribution alignment is central to the efficacy of cross-modal distillation and provides a general framework with formal guarantees. Extensive experiments show that the proposed approach significantly outperforms existing methods across multiple multimodal benchmarks, achieving strong performance in both paired and unpaired data settings.
📝 Abstract
Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.
Problem

Research questions and friction points this paper is trying to address.

Cross-Modal Knowledge Distillation
Unpaired Data
Multi-modal Learning
Knowledge Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal knowledge distillation
unpaired data
distribution alignment
feature alignment
label alignment
🔎 Similar Papers
No similar papers found.
T
Trong Khiem Tran
School of Electrical Engineering and Computer Science, Washington State University, Pullman, US
A
Anh Duc Chu
School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
Q
Quang Hung Pham
School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
P
Phi Le Nguyen
School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
Trong Nghia Hoang
Trong Nghia Hoang
Assistant Professor, Washington State University
Machine LearningFederated LearningMeta LearningModel FusionGaussian Processes