Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Causal language models are inherently limited in knowledge representation due to their reliance solely on preceding context. This work proposes the Regret Pre-training framework, which, for the first time, incorporates future context as privileged information during pre-training through a dual-view architecture: a student view performs standard causal modeling, while a teacher view leverages future tokens to generate a conditional distribution. A regret loss—defined via KL divergence minimization between the two views—injects future-aware signals into the causal representations. Without increasing model parameters, the approach integrates self-supervised learning with an extended attention mechanism. After training on 4 billion tokens, GlobalRegret and LocalRegret achieve average accuracies of 33.9% and 32.2%, respectively, across nine downstream tasks, significantly outperforming the baseline (30.2%); notably, GlobalRegret yields an 18.1-percentage-point improvement on BoolQ.

📝 Abstract

Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.

Problem

Research questions and friction points this paper is trying to address.

causal language models

future information

knowledge grounding

sequence modeling

privileged information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regret Pre-training

Learning Using Privileged Information

Causal Language Modeling