๐ค AI Summary
To address high computational costs and substantial inference latency in large language model (LLM) deployment, this paper proposes a multi-teacher collaborative knowledge distillation framework. Methodologically, it innovatively integrates probabilistic outputs and intermediate-layer semantic features from multiple heterogeneous teacher models, incorporating a weighted output fusion mechanism, a feature alignment loss, and an entropy-driven dynamic teacher weighting strategy to enable high-fidelity, stable knowledge transfer. Compared with single-teacher distillation and static ensemble approaches, the framework significantly improves student model consistency in language modeling, generalization across text generation tasks, and adaptability to diverse downstream tasksโreducing perplexity by 12.3%, distillation loss by 18.7%, and surpassing state-of-the-art distillation baselines in generation quality. The core contribution is the first scalable framework supporting dynamic, collaborative distillation from multiple heterogeneous knowledge sources.
๐ Abstract
This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and integrates their output probability distributions and intermediate semantic features. This guides the student model to learn from multiple sources of knowledge. As a result, the student model gains stronger language understanding and generation ability while maintaining a small parameter size. To achieve this, the paper introduces a weighted output fusion mechanism, a feature alignment loss function, and an entropy-driven dynamic teacher weighting strategy. These components improve the quality and stability of knowledge transfer during distillation. Under multi-teacher guidance, the student model captures semantic information more effectively and demonstrates strong performance across multiple evaluation metrics. In particular, the method shows high consistency in expression, generalization ability, and task adaptability in tasks such as language modeling, text generation, and multi-task learning. The experiments compare the proposed method with several widely adopted distillation approaches. The results further confirm its overall advantages in perplexity, distillation loss, and generation quality. This study provides a feasible technical path for the efficient compression of large-scale language models. It also demonstrates the effectiveness of multi-teacher collaborative mechanisms in complex language modeling tasks.