The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM summarization faithfulness benchmarks suffer from ambiguous boundaries of external knowledge, leading to inconsistent human annotations. Method: We introduce “external dependency” as a distinct intermediate category—separating faithful, unfaithful, and legitimately knowledge-augmented summaries—and construct VeriGray, the first benchmark explicitly covering gray-zone cases. Based on this ternary annotation schema, we conduct large-scale human validation and multi-model evaluation. Contribution/Results: We find that mainstream LLMs—including GPT-5—exhibit, on average, 6% hallucination and 8% external dependency rates. VeriGray substantially improves inter-annotator agreement and poses a significant challenge to current faithfulness detection methods, underscoring the critical importance of modeling gray zones in evaluation frameworks.

Technology Category

Application Category

📝 Abstract
Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($sim 6%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($sim 8%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.
Problem

Research questions and friction points this paper is trying to address.

Addressing annotation ambiguity in LLM faithfulness detection
Defining permissible external knowledge boundaries in summaries
Establishing reliable verification framework for unfaithfulness detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Out-Dependent category for external knowledge verification
Proposes novel faithfulness annotation framework addressing ambiguity
Constructs VeriGray benchmark for unfaithfulness detection in summarization
Q
Qiang Ding
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China; State Key Lab of Al Safety, Beijing 100094, China; University of Chinese Academy of Sciences, Beijing 100049, China
L
Lvzhou Luo
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China; State Key Lab of Al Safety, Beijing 100094, China; University of Chinese Academy of Sciences, Beijing 100049, China
Yixuan Cao
Yixuan Cao
Shenzhen University
Software EngineeringSecurityKernel & CompilerTesting & VerificationBig Data
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing