🤖 AI Summary
This work addresses the challenge of hallucination in large language models during multi-step reasoning, where existing methods struggle to pinpoint the first erroneous step within a single forward pass. The authors model the reasoning process as a trajectory of hidden states and reframe hallucination detection as a trajectory dynamics problem. They introduce a teacher model based on contrastive principal component analysis (contrastive PCA), which—combined with tools from optimal transport geometry—identifies anomalous transitions that deviate from a stable manifold. A lightweight student model is then distilled via BiLSTM. Theoretically, they establish the optimality of contrastive PCA in transport-based separation. Experiments demonstrate substantial improvements over entropy-, probe-, and attention-based baselines on ProcessBench and PRM800K; the teacher exhibits strong generalization, whereas the student proves more sensitive to distributional shifts.
📝 Abstract
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.