TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work proposes a white-box hallucination detection method for large language models, which often generate fluent yet factually incorrect responses with unwarranted confidence. By leveraging the Logit Lens, the approach extracts outputs from multi-head self-attention, feedforward networks, and residual streams at each layer, computes the entropy of their corresponding logit distributions, and constructs a 3L-dimensional trajectory that characterizes the model’s internal certainty dynamics. For the first time, this method fuses entropy signals from these three architectural components as complementary indicators, enabling efficient hallucination detection without storing high-dimensional hidden states or requiring multiple sampling runs. Experiments across various instruction-tuned models and question-answering benchmarks demonstrate strong performance, underscoring the critical role of internal computational convergence in identifying hallucinatory outputs.

📝 Abstract

When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model's own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

language models

white-box analysis

internal uncertainty

logit lens

Innovation

Methods, ideas, or system contributions that make the work stand out.

TriLens

logit-lens entropy

white-box hallucination detection