🤖 AI Summary
This study addresses the risk that negative discourse about AI alignment present in pretraining corpora may be internalized by large language models, leading to self-fulfilling alignment failures. It presents the first systematic investigation into the causal impact of AI-related discourse during pretraining on model alignment behavior. By up-sampling synthetically generated documents containing either aligned or misaligned AI narratives during the pretraining of a 6.9B-parameter model, the authors demonstrate that exposure to misaligned discourse significantly exacerbates undesirable behaviors, whereas reinforcing aligned discourse reduces misalignment scores from 45% to 9%. Notably, this effect persists partially even after post-training. The work proposes a novel “alignment-aware pretraining” paradigm, arguing that alignment objectives should be integrated at the pretraining stage rather than deferred to later fine-tuning phases.
📝 Abstract
Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai