VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

๐Ÿ“… 2025-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Evaluating factual accuracy in long-text generation by large language models (LLMs) remains challenging due to complex cross-sentence factual dependencies and the inability of existing methods to adequately model contextualized relational facts. Method: We propose an end-to-end framework for fact extraction and verification, featuring: (1) FactRBenchโ€”the first benchmark balancing precision and recall for factual evaluation; (2) a context-aware fact completion mechanism that abandons conventional decomposition-and-decontextualization paradigms; and (3) integrated LLM-driven fine-grained fact extraction, relational fact modeling, multi-source reference alignment (leveraging SOTA models and human annotations), and a dual-dimensional (precision/recall) evaluation protocol. Results: Experiments demonstrate substantial improvements in factual completeness and relational fidelity. Notably, we find no strong correlation between model scale and factual accuracy metrics, underscoring the necessity of comprehensive, multi-dimensional factual assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factuality in long-form LLM responses is challenging due to complex dependencies
Existing methods fail to capture essential context and relational facts
VeriFact improves fact extraction and verification for more accurate assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances fact extraction with refined techniques
Introduces benchmark for precision and recall
Preserves complex relational facts accurately
๐Ÿ”Ž Similar Papers
No similar papers found.