Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation

๐Ÿ“… 2025-03-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) frequently generate factually inconsistent outputsโ€”so-called โ€œfactual hallucinations.โ€ Existing multi-sample self-consistency approaches suffer from high latency and fail to correct high-confidence errors. To address this, we propose an online factual monitoring and tree-based intervention mechanism operating during autoregressive decoding: at each token step, it dynamically assesses the factual plausibility of partial generations, identifies high-risk factual keywords, and triggers localized re-decoding to rectify inconsistencies. Crucially, our method abandons the costly full-sequence resampling paradigm, enabling real-time, fine-grained, low-overhead factual calibration *during* generation. Evaluated across multiple factual consistency benchmarks, our approach achieves significantly higher factual accuracy than self-consistency baselines while reducing inference latency by over 40% and substantially lowering computational overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
While large language models have demonstrated exceptional performance across a wide range of tasks, they remain susceptible to hallucinations -- generating plausible yet factually incorrect contents. Existing methods to mitigating such risk often rely on sampling multiple full-length generations, which introduces significant response latency and becomes ineffective when the model consistently produces hallucinated outputs with high confidence. To address these limitations, we introduce Monitoring Decoding (MD), a novel framework that dynamically monitors the generation process and selectively applies in-process interventions, focusing on revising crucial tokens responsible for hallucinations. Instead of waiting until completion of multiple full-length generations, we identify hallucination-prone tokens during generation using a monitor function, and further refine these tokens through a tree-based decoding strategy. This approach ensures an enhanced factual accuracy and coherence in the generated output while maintaining efficiency. Experimental results demonstrate that MD consistently outperforms self-consistency-based approaches in both effectiveness and efficiency, achieving higher factual accuracy while significantly reducing computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in large language models
Dynamically monitors and revises hallucination-prone tokens
Improves factual accuracy and reduces computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic monitoring of generation process
Selective in-process intervention on tokens
Tree-based decoding for factual accuracy
๐Ÿ”Ž Similar Papers
No similar papers found.