BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

📅 2024-06-19

🏛️ International Conference on Computational Linguistics

📈 Citations: 1

✨ Influential: 1

career value

182K/year

🤖 AI Summary

Knowledge distillation for large language models (LLMs) suffers from severe logit long-tail distributions, high noise sensitivity, and underutilization of intrinsic ordinal relationships among logits. Method: This work first identifies the extreme long-tail characteristic and latent noise in post-fine-tuning LLM logits, and proposes Bidirectional Logit Difference loss (BiLD). BiLD constructs bidirectional differences exclusively over the top-k logits, jointly suppressing noise and explicitly modeling ordinal structure—thereby circumventing conventional KL divergence’s reliance on full-logit distributions. Contribution/Results: Evaluated across 13 NLP benchmarks, BiLD using only top-8 logits consistently outperforms supervised fine-tuning and six state-of-the-art distillation methods—including both NLP- and CV-oriented approaches—yielding substantial gains in small-model accuracy and generalization. This establishes a new paradigm for efficient, robust LLM lightweighting and deployment.

Technology Category

Application Category

📝 Abstract

In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden"noise"in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.

Problem

Research questions and friction points this paper is trying to address.

Addresses long-tail noise in LLM logits

Enhances logit ranking information utilization

Improves knowledge distillation for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-directional Logits Difference Loss

Top-k logits utilization

Internal ranking information leverage

🔎 Similar Papers

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs