🤖 AI Summary
To address three key challenges in small language model (SLM) training—information loss from hard pruning, inefficient representation alignment, and underutilization of feed-forward network (FFN) activations—this paper proposes Low-Rank Cloning (LRC). LRC jointly learns projection matrices within a unified low-rank framework, enabling soft pruning of teacher weights and end-to-end alignment of full-layer student activations—including critical FFN signals—without auxiliary alignment modules. Crucially, LRC is the first method to unify soft pruning and activation cloning, thereby substantially improving FFN information utilization efficiency. Experiments on Llama-3.2 and Qwen2.5 demonstrate that LRC achieves performance on par with or surpassing state-of-the-art models trained on trillion-token corpora, using only 20 billion tokens—a >1000× improvement in training efficiency. The code and pretrained checkpoints are publicly released.
📝 Abstract
Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.