🤖 AI Summary
This work investigates the capability of large language models (LLMs) in detecting hallucinations within mixed-context settings, particularly their inability to reliably distinguish *factual hallucinations*—plausible yet subtly incorrect statements—from *non-factual hallucinations*—overtly absurd or logically inconsistent content. Through multi-scale experiments involving direct generation and retrieval-augmented detection on standardized summarization benchmarks, augmented by human annotation and controlled hallucination injection, we uncover a systematic overestimation bias toward factual hallucinations, rooted in LLMs’ intrinsic imbalance in knowledge activation versus contextual grounding. Results show that all evaluated LLMs exhibit, on average, 27.4% lower accuracy on factual hallucination detection than on non-factual ones—identifying this as a critical bottleneck. Building on this finding, we propose the *knowledge–context co-utilization* paradigm, offering both theoretical insight into hallucination robustness and a principled technical framework for mitigation.
📝 Abstract
With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.