A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in small language model (SLM) training—information loss from hard pruning, inefficient representation alignment, and underutilization of feed-forward network (FFN) activations—this paper proposes Low-Rank Cloning (LRC). LRC jointly learns projection matrices within a unified low-rank framework, enabling soft pruning of teacher weights and end-to-end alignment of full-layer student activations—including critical FFN signals—without auxiliary alignment modules. Crucially, LRC is the first method to unify soft pruning and activation cloning, thereby substantially improving FFN information utilization efficiency. Experiments on Llama-3.2 and Qwen2.5 demonstrate that LRC achieves performance on par with or surpassing state-of-the-art models trained on trillion-token corpora, using only 20 billion tokens—a >1000× improvement in training efficiency. The code and pretrained checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.
Problem

Research questions and friction points this paper is trying to address.

High cost of training Small Language Models (SLMs) efficiently
Information loss and inefficiency in existing knowledge distillation methods
Underutilization of informative activations from teacher models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Clone (LRC) for efficient pre-training
Soft pruning via low-rank projection matrices
Aligns student activations with teacher FFN signals
🔎 Similar Papers
No similar papers found.