FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

📅 2024-10-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the factual inconsistency—particularly hallucination—of large language models (LLMs) in realistic human–AI interaction scenarios, proposing a novel factuality evaluation paradigm tailored to dynamic settings. Methodologically, we introduce VERIFY, an assessment pipeline that performs ternary judgment (support / refute / indeterminate) on model outputs via Web-based retrieval. We further present FactBench: the first dynamically updatable, 1K-scale benchmark driven by high-frequency hallucination-inducing prompts. Key contributions include: (i) the first formal definition and systematic mining mechanism for “hallucination prompts”; and (ii) a fine-grained, scalable, and human–AI alignment-calibrated factuality evaluation framework. Empirical findings reveal: closed-source models generally outperform open-source counterparts but exhibit greater sensitivity to question difficulty; Llama3.1-405B shows heightened subjectivity, elevating the “indeterminate” rate; and Gemini 1.5 Pro exhibits an excessive refusal rate of 25%.

Technology Category

Application Category

📝 Abstract
The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify"hallucination prompts"across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.
Problem

Research questions and friction points this paper is trying to address.

Natural Language Processing
Factuality Assessment
Human Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

VERIFY System
FACTBENCH
Model Accuracy Assessment
🔎 Similar Papers
No similar papers found.