🤖 AI Summary
Existing knowledge distillation approaches overemphasize hard negative samples while neglecting the overall distribution of teacher model output scores, thereby limiting the student model’s ability to learn a complete preference structure and impairing its generalization. This work is the first to systematically highlight the importance of preserving the teacher’s score distribution and proposes a hierarchical sampling strategy that uniformly covers the full range of teacher-assigned scores during training. By effectively maintaining the variance and entropy of the teacher’s outputs, this method enables more comprehensive transfer of preference information. It overcomes the limitations of conventional approaches that rely solely on hard negatives and achieves significant performance gains over Top-K and random sampling baselines on both in-domain and out-of-domain dense retrieval benchmarks, demonstrating that sustaining the diversity of teacher scores is crucial for enhancing student model performance.
📝 Abstract
Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.