🤖 AI Summary
Causal language models are inherently limited in knowledge representation due to their reliance solely on preceding context. This work proposes the Regret Pre-training framework, which, for the first time, incorporates future context as privileged information during pre-training through a dual-view architecture: a student view performs standard causal modeling, while a teacher view leverages future tokens to generate a conditional distribution. A regret loss—defined via KL divergence minimization between the two views—injects future-aware signals into the causal representations. Without increasing model parameters, the approach integrates self-supervised learning with an extended attention mechanism. After training on 4 billion tokens, GlobalRegret and LocalRegret achieve average accuracies of 33.9% and 32.2%, respectively, across nine downstream tasks, significantly outperforming the baseline (30.2%); notably, GlobalRegret yields an 18.1-percentage-point improvement on BoolQ.
📝 Abstract
Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.