🤖 AI Summary
This study identifies the “illusion of truth effect” in large language models (LLMs) during continual pretraining—where repeated exposure to false yet confidently stated assertions systematically distorts model factual beliefs due to data contamination. Prior work focuses on static poisoning, overlooking dynamic training-phase vulnerabilities.
Method: We propose the Layer of Truth framework and a dedicated dataset to systematically probe belief evolution across model scales, network layers, and training stages. Using controlled false-data injection and intermediate representation probing, we track representational shifts throughout training.
Contribution/Results: Experiments demonstrate that even small amounts of high-confidence false information induce persistent, cross-checkpoint representational drift. Sensitivity exhibits clear layer-wise and scale-dependent patterns. Our work provides the first empirical characterization of LLM cognitive fragility under continual learning and delivers both foundational insights and methodological tools for developing robust adaptive training mechanisms.
📝 Abstract
Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth.
We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.