🤖 AI Summary
Large language models (LLMs) frequently generate factually incorrect statements, and conventional holistic fact verification approaches struggle to localize subtle factual errors. To address this, we propose a novel fine-grained fact verification paradigm: decomposing complex claims into semantically equivalent, independently verifiable sub-claims to enhance error localization accuracy and evidence retrieval transparency. Toward this end, we introduce FactLens—the first benchmark explicitly designed for fine-grained fact verification—comprising human-annotated data, sub-claim generation, and alignment modeling. We further design a multidimensional automated evaluation framework that jointly assesses faithfulness, completeness, and verifiability, ensuring both semantic fidelity and contextual consistency. Our FactLens evaluator achieves high agreement with human judgments (Spearman ρ > 0.85) and, for the first time, systematically reveals significant impacts of sub-claim characteristics—including length and abstraction level—on verification performance.
📝 Abstract
Large Language Models (LLMs) have shown impressive capability in language generation and understanding, but their tendency to hallucinate and produce factually incorrect information remains a key limitation. To verify LLM-generated contents and claims from other sources, traditional verification approaches often rely on holistic models that assign a single factuality label to complex claims, potentially obscuring nuanced errors. In this paper, we advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification, allowing for more precise identification of inaccuracies, improved transparency, and reduced ambiguity in evidence retrieval. However, generating sub-claims poses challenges, such as maintaining context and ensuring semantic equivalence with respect to the original claim. We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality. The benchmark data is manually curated to ensure high-quality ground truth. Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.