🤖 AI Summary
This work addresses the instability of internal representations in large language models that underlies hallucination, a challenge inadequately mitigated by existing defenses due to their lack of fine-grained intervention capabilities. The study identifies, for the first time, high-variance dimensions in Transformer hidden states—termed H-Nodes—that are strongly correlated with hallucinatory behavior at the single-dimension level, and leverages them to construct white-box attacks that induce controllable hallucinations. Building on this insight, the authors propose an adaptive noise cancellation mechanism that integrates logistic regression probes, real-time forward hooks for intervention, confidence-weighted scoring, and dynamic re-ranking to enable iterative adversarial defense. Experiments across models ranging from OPT-125M to LLaMA-3-8B demonstrate a 33–42% reduction in activation drift and an increase in robustness from 8% to 69%, with minimal impact on general capabilities—perplexity increases by less than 5% and MMLU performance declines by no more than 3%.
📝 Abstract
We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions -- termed Hallucination Nodes (H-Nodes) -- with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (<5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.