Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the architectural origins of hallucinations in large language models (LLMs). Method: We propose the “Distributional Semantic Tracing” framework, integrating causal intervention with dual-process theory to identify, for the first time, an inherent “commitment layer” that renders hallucinations unavoidable, and to uncover the semantic conflict mechanism between fast associative and slow contextual processing pathways. We further define a semantic coherence metric to quantify semantic consistency along the contextual pathway. Contribution/Results: Experiments show this metric exhibits a strong negative correlation with hallucination rate (ρ = −0.863), enabling precise localization of hallucination onset—both spatially (layer-wise) and temporally (token-wise)—and revealing its underlying semantic cause. Our study provides the first interpretable and predictive mechanistic account of hallucinations in Transformers, grounded in distributional semantics and causal analysis.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose extbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific extbf{commitment layer} where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic extbf{associative pathway} (akin to System 1) and a slow, deliberate extbf{contextual pathway} (akin to System 2), leading to predictable failure modes such as extit{Reasoning Shortcut Hijacks}. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($ρ= -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Problem

Research questions and friction points this paper is trying to address.

Investigates architectural origins of hallucination in LLMs

Identifies commitment layer where hallucinations become inevitable

Reveals conflict between associative and contextual reasoning pathways

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework traces semantic failures via interpretability techniques

Identifies commitment layer where hallucinations become inevitable

Reveals conflict between associative and contextual computational pathways

🔎 Similar Papers

No similar papers found.

Authors to Follow