Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) in detecting hallucinations within mixed-context settings, particularly their inability to reliably distinguish *factual hallucinations*—plausible yet subtly incorrect statements—from *non-factual hallucinations*—overtly absurd or logically inconsistent content. Through multi-scale experiments involving direct generation and retrieval-augmented detection on standardized summarization benchmarks, augmented by human annotation and controlled hallucination injection, we uncover a systematic overestimation bias toward factual hallucinations, rooted in LLMs’ intrinsic imbalance in knowledge activation versus contextual grounding. Results show that all evaluated LLMs exhibit, on average, 27.4% lower accuracy on factual hallucination detection than on non-factual ones—identifying this as a critical bottleneck. Building on this finding, we propose the *knowledge–context co-utilization* paradigm, offering both theoretical insight into hallucination robustness and a principled technical framework for mitigation.

Technology Category

Application Category

📝 Abstract

With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect mixed-context hallucinations.

Assessing biases in LLMs' intrinsic knowledge during hallucination evaluation.

Balancing intrinsic knowledge and external context for accurate hallucination detection.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses summarization to evaluate mixed-context hallucinations

Compares direct generation and retrieval-based LLM models

Balances intrinsic knowledge with external context for accuracy

🔎 Similar Papers

No similar papers found.