LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Extending the context window of large language models often degrades short-text task performance due to hidden-state distribution shift and catastrophic forgetting induced by continued pretraining. To address this, we propose a restorative distillation framework featuring (i) novel short-text hidden-state distillation and (ii) short-to-long output distribution alignment. Our method jointly optimizes these objectives with skip-position indexing modeling, distribution consistency constraints, and lightweight continued pretraining. Empirically, it preserves or even surpasses baseline long-context understanding capabilities while significantly mitigating performance degradation on short-text tasks. Across multiple text benchmarks, our approach achieves synergistic improvement in both short- and long-context task performance. This work establishes a new paradigm for balancing generalization across context lengths in scalable-context models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.

Problem

Research questions and friction points this paper is trying to address.

Mitigate short-text performance degradation

Address distribution drift in hidden states

Prevent catastrophic forgetting in pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Restoration Distillation technique

Minimizes distribution discrepancy

Aligns short-long text outputs

🔎 Similar Papers

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models