Knowledge Distillation for Large Language Models

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge of efficiently deploying large language models in resource-constrained environments by proposing a compression framework that integrates knowledge distillation, chain-of-thought prompting, and Group Relative Policy Optimization. The approach transfers capabilities from a Qwen 3B teacher model to a compact 0.5B-parameter student model. Through joint training on multilingual and code data, combined with 4-bit weight quantization, the student model retains 70%–91% of the teacher’s performance on English tasks, achieves up to 95% on Spanish tasks, and reaches a Rouge-L score of 93.5% on code generation. The method substantially reduces memory consumption and inference latency while enhancing output coherence and code correctness.

Technology Category

Application Category

📝 Abstract

We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation

Large Language Models

Model Compression

Resource Efficiency

Chain-of-Thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

chain-of-thought

reinforcement learning