🤖 AI Summary
This study investigates whether large language models (LLMs) possess intrinsic self-hallucination detection capability—i.e., the ability to identify factually incorrect statements they themselves generate.
Method: We propose a Chain-of-Thought (CoT)-driven framework for self-consistency analysis and knowledge provenance, formalizing hallucination detection as a sentence-level binary classification task. Crucially, it explicitly extracts implicit knowledge encoded in model parameters to support self-verification—a first in hallucination research.
Contribution/Results: Experiments show that GPT-3.5 Turbo achieves 58.2% accuracy in detecting its own hallucinations under this framework, substantially outperforming non-reasoning baselines. This demonstrates that LLMs exhibit non-negligible intrinsic self-correction capacity. Our key contributions are: (1) the first CoT-based classification paradigm specifically designed for self-hallucination detection; (2) empirical validation that model-internal implicit knowledge is effective for factual self-assessment; and (3) a novel pathway toward enhancing LLM reliability through parameter-intrinsic verification mechanisms.
📝 Abstract
Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs'capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT-$3.5$ Turbo with CoT detected $58.2%$ of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.