🤖 AI Summary
This study addresses a critical oversight in existing large language model distillation research: the neglect of temperature’s pivotal role when comparing forward and reverse KL divergence, which has led to biased conclusions. Through systematic investigation, we reveal an asymmetry in how temperature modulates gradients under the two KL formulations—elevated temperatures substantially amplify signals from non-dominant tokens in forward KL, whereas they merely rescale gradients in reverse KL. Comprehensive evaluation via theoretical analysis, temperature scaling, and instruction fine-tuning benchmarks demonstrates that appropriately increasing temperature enables forward KL distillation to consistently outperform reverse KL. Moreover, a simple KL-based approach with calibrated temperature rivals state-of-the-art distillation methods, challenging the prevailing empirical belief that reverse KL is inherently superior.
📝 Abstract
Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $τ$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $τ$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $τ=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.