Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal performance of small language models (SLMs) on resource-constrained edge devices, this paper proposes a systematic post-training framework integrating curriculum-based supervised fine-tuning (Curriculum SFT) and offline on-policy policy knowledge distillation. Notably, it is the first work to incorporate curriculum learning into the policy distillation pipeline, significantly enhancing the reasoning capability and task generalization of billion-parameter SLMs under stringent hardware constraints. Leveraging Ascend-based edge platform optimizations, the resulting instruction-tuned model achieves state-of-the-art (SOTA) performance across diverse complex tasks—matching the accuracy of large language models while reducing inference latency by over 60% and cutting hardware resource consumption by an order of magnitude. This work establishes a new, cost-effective, and generalizable paradigm for deploying high-performance SLMs in edge AI scenarios.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
Problem

Research questions and friction points this paper is trying to address.

Enhancing small model accuracy via knowledge distillation
Bridging performance gap for resource-constrained edge devices
Developing efficient language models for hardware-limited environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum-based supervised fine-tuning for small models
Offline on-policy knowledge distillation technique
Systematic post-training pipeline for edge devices
🔎 Similar Papers
No similar papers found.
Miao Rang
Miao Rang
Huawei Technologies Co., Ltd.
computer vision
Z
Zhenni Bi
Huawei Noah’s Ark Lab
H
Hang Zhou
Huawei Noah’s Ark Lab
Hanting Chen
Hanting Chen
Noah's Ark Lab, Huawei
deep learningmachine learningcomputer vision
A
An Xiao
Huawei Noah’s Ark Lab
T
Tianyu Guo
Huawei Noah’s Ark Lab
K
Kai Han
Huawei Noah’s Ark Lab
X
Xinghao Chen
Huawei Noah’s Ark Lab
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision