H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Hallucination in large language models (LLMs) severely undermines their reliability, yet its underlying neural mechanisms remain poorly understood. This work systematically uncovers the micro-scale neural basis of hallucination generation at the neuron level. Through fine-grained neuron-level activation analysis, causal intervention experiments, and cross-task generalization tests, we identify a sparse set of neurons—constituting less than 0.1% of total neurons—that stably predict hallucinatory outputs across diverse tasks and model architectures. Crucially, these neurons are shown to emerge during pretraining and exhibit a causal relationship with model overcompliance behavior. Our study establishes, for the first time, an interpretable link between macroscopic hallucination phenomena and microscopic neural activity. This yields a novel, mechanistic paradigm for targeted intervention and provides actionable pathways toward enhancing LLM reliability through neuron-level steering.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

Problem

Research questions and friction points this paper is trying to address.

Identifies sparse neurons predicting hallucinations in LLMs

Shows these neurons cause over-compliance behaviors in models

Traces hallucination neurons to origins in pre-training phase

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse neurons predict hallucinations accurately

Neurons cause over-compliance behaviors directly

Hallucination neurons originate in pre-training phase

🔎 Similar Papers

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers

2024-10-07arXiv.orgCitations: 5

Authors to Follow