🤖 AI Summary
This work addresses the reliance of cross-modal knowledge distillation on costly paired data by proposing the first unpaired cross-modal distillation framework that operates without sample-level alignment. The method enables effective knowledge transfer by jointly aligning the feature distributions and prediction label distributions between teacher and student models, thereby eliminating the need for semantic correspondence at the individual sample level. Theoretical analysis demonstrates that distribution alignment is central to the efficacy of cross-modal distillation and provides a general framework with formal guarantees. Extensive experiments show that the proposed approach significantly outperforms existing methods across multiple multimodal benchmarks, achieving strong performance in both paired and unpaired data settings.
📝 Abstract
Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.