Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing heterogeneous knowledge distillation methods inadequately exploit the “dark knowledge” embedded in teacher models, limiting cross-architecture transfer performance. To address this, we propose a Multi-level Decoupled Relation Distillation (MDRD) framework, introducing two novel mechanisms: Decoupled Fine-grained Relation Alignment (DFRA) and Multi-Scale Dynamic Fusion (MSDF). DFRA and MSDF jointly model structured relational knowledge at both the logit and feature levels, simultaneously enhancing dark knowledge extraction and correct-class confidence optimization. By integrating cross-architecture feature projection with a unified relation distillation paradigm, MDRD enables efficient knowledge transfer across heterogeneous architectures—including CNNs, Transformers, MLPs, and Mamba. Extensive experiments on CIFAR-100 and Tiny-ImageNet demonstrate state-of-the-art performance, achieving absolute accuracy gains of up to 4.86% and 2.78% over the best prior methods, respectively. Moreover, MDRD significantly improves robustness and architectural generalizability.

Technology Category

Application Category

📝 Abstract
Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Enhances knowledge transfer in heterogeneous architectures
Improves distillation via multi-level relational alignment
Boosts student model performance across datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Finegrained Relation Alignment
Multi-Scale Dynamic Fusion
Heterogeneous Architecture Distillation
🔎 Similar Papers
No similar papers found.
Yaoxin Yang
Yaoxin Yang
Fudan University
Efficient Deep LearningMLLMModel Compression
P
Peng Ye
School of Information Science and Technology, Fudan University
Weihao Lin
Weihao Lin
PHD Student, Fudan University
Deep learningComputer visionVideo understandingModel compression
Kangcong Li
Kangcong Li
Fudan University
Yan Wen
Yan Wen
Undergraduate, Fudan University
Machine Learning
J
Jia Hao
School of Information Science and Technology, Fudan University
T
Tao Chen
School of Information Science and Technology, Fudan University