🤖 AI Summary
Current LLM factuality evaluation is hindered by ambiguous definitions of “atomic facts,” leading to substantial inter-annotator disagreement between human and model annotators. To address this, we propose the first visualization-based analytical framework specifically designed for diagnosing ambiguity in fact decomposition. Our method systematically identifies and supports iterative refinement of three core issues—semantic misalignment, granularity mismatch, and referential dependency—through interactive visualizations. It integrates state-of-the-art NLP-based fact decomposition techniques with interpretable visual design to localize inconsistencies and guide targeted revisions. Experimental results demonstrate that our framework significantly improves inter-annotator agreement (Cohen’s κ increases by 0.32), enhances the stability and reproducibility of factuality assessment, and establishes a collaborative, auditable infrastructure for developing high-fidelity evaluation benchmarks.
📝 Abstract
Factuality evaluation of large language model (LLM) outputs requires decomposing text into discrete "atomic" facts. However, existing definitions of atomicity are underspecified, with empirical results showing high disagreement among annotators, both human and model-based, due to unresolved ambiguity in fact decomposition. We present a visual analytics concept to expose and analyze annotation inconsistencies in fact extraction. By visualizing semantic alignment, granularity and referential dependencies, our approach aims to enable systematic inspection of extracted facts and facilitate convergence through guided revision loops, establishing a more stable foundation for factuality evaluation benchmarks and improving LLM evaluation.