Direct Preference Knowledge Distillation for Large Language Models

📅 2024-06-28

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

187K/year

🤖 AI Summary

To address the weak modeling capability and low computational efficiency of Kullback–Leibler (KL) divergence in large language model (LLM) knowledge distillation, this paper proposes Direct Preference Knowledge Distillation (DPKD). DPKD introduces a novel implicit reward-guided distillation paradigm: it explicitly models the teacher’s implicit reward signal and output preference probabilities as the distillation objective, integrates reverse KL divergence into a distributional preference loss, and establishes a two-stage optimization framework. We provide theoretical guarantees on its convergence and effectiveness. Experiments across model scales from 120M to 13B parameters demonstrate that DPKD significantly improves response accuracy and exact-match rates, consistently outperforming state-of-the-art baselines. Moreover, it exhibits strong generalization across diverse model sizes.

Technology Category

Application Category

📝 Abstract

In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency in distilling knowledge from large language models

Addressing limitations of traditional KL divergence in distillation

Enhancing student model performance via preference-based distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses implicit reward as KL divergence supplement

Formulates KD into two optimization stages

Improves preference probability via distribution divergence

🔎 Similar Papers

No similar papers found.