🤖 AI Summary
Extending the context window of large language models often degrades short-text task performance due to hidden-state distribution shift and catastrophic forgetting induced by continued pretraining. To address this, we propose a restorative distillation framework featuring (i) novel short-text hidden-state distillation and (ii) short-to-long output distribution alignment. Our method jointly optimizes these objectives with skip-position indexing modeling, distribution consistency constraints, and lightweight continued pretraining. Empirically, it preserves or even surpasses baseline long-context understanding capabilities while significantly mitigating performance degradation on short-text tasks. Across multiple text benchmarks, our approach achieves synergistic improvement in both short- and long-context task performance. This work establishes a new paradigm for balancing generalization across context lengths in scalable-context models.
📝 Abstract
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.