Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) frequently generate factually inconsistent outputs—so-called “factual hallucinations.” Existing multi-sample self-consistency approaches suffer from high latency and fail to correct high-confidence errors. To address this, we propose an online factual monitoring and tree-based intervention mechanism operating during autoregressive decoding: at each token step, it dynamically assesses the factual plausibility of partial generations, identifies high-risk factual keywords, and triggers localized re-decoding to rectify inconsistencies. Crucially, our method abandons the costly full-sequence resampling paradigm, enabling real-time, fine-grained, low-overhead factual calibration *during* generation. Evaluated across multiple factual consistency benchmarks, our approach achieves significantly higher factual accuracy than self-consistency baselines while reducing inference latency by over 40% and substantially lowering computational overhead.

Technology Category

Application Category

📝 Abstract

While large language models have demonstrated exceptional performance across a wide range of tasks, they remain susceptible to hallucinations -- generating plausible yet factually incorrect contents. Existing methods to mitigating such risk often rely on sampling multiple full-length generations, which introduces significant response latency and becomes ineffective when the model consistently produces hallucinated outputs with high confidence. To address these limitations, we introduce Monitoring Decoding (MD), a novel framework that dynamically monitors the generation process and selectively applies in-process interventions, focusing on revising crucial tokens responsible for hallucinations. Instead of waiting until completion of multiple full-length generations, we identify hallucination-prone tokens during generation using a monitor function, and further refine these tokens through a tree-based decoding strategy. This approach ensures an enhanced factual accuracy and coherence in the generated output while maintaining efficiency. Experimental results demonstrate that MD consistently outperforms self-consistency-based approaches in both effectiveness and efficiency, achieving higher factual accuracy while significantly reducing computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in large language models

Dynamically monitors and revises hallucination-prone tokens

Improves factual accuracy and reduces computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic monitoring of generation process

Selective in-process intervention on tokens

Tree-based decoding for factual accuracy

🔎 Similar Papers

No similar papers found.