Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting hallucinations in long reasoning trajectories generated by large reasoning models, which often appear coherent yet contain latent errors that are difficult to identify. To this end, the authors propose Answer-consistency Representation Shaping (ARS), a self-supervised method that applies implicit-space perturbations to trajectory boundary embeddings to generate counterfactual answers. By leveraging answer consistency as a signal for contrastive learning, ARS constructs more discriminative trajectory representations that reveal instability in the reasoning process without requiring human annotations. Experiments demonstrate that ARS consistently outperforms strong baselines across multiple benchmarks and can be seamlessly integrated as a plug-and-play component into existing hallucination detection systems.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines.
Problem

Research questions and friction points this paper is trying to address.

hallucination detection
reasoning trajectories
large reasoning models
answer validity
representation shaping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Answer-agreement Representation Shaping
hallucination detection
reasoning trajectories
counterfactual perturbation
embedding-based detection
🔎 Similar Papers
No similar papers found.