Knowledge Distillation for Large Language Models

πŸ“… 2026-03-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of efficiently deploying large language models in resource-constrained environments by proposing a compression framework that integrates knowledge distillation, chain-of-thought prompting, and Group Relative Policy Optimization. The approach transfers capabilities from a Qwen 3B teacher model to a compact 0.5B-parameter student model. Through joint training on multilingual and code data, combined with 4-bit weight quantization, the student model retains 70%–91% of the teacher’s performance on English tasks, achieves up to 95% on Spanish tasks, and reaches a Rouge-L score of 93.5% on code generation. The method substantially reduces memory consumption and inference latency while enhancing output coherence and code correctness.

Technology Category

Application Category

πŸ“ Abstract
We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Large Language Models
Model Compression
Resource Efficiency
Chain-of-Thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation
chain-of-thought
reinforcement learning
model compression
quantization
πŸ”Ž Similar Papers
No similar papers found.