What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic understanding of reasoning capabilities evaluated in existing fact-checking benchmarks. The authors propose the first decomposition of reasoning components in fact-checking, generating 24K structured reasoning traces using GPT-4o-mini and employing a lightweight 1B-parameter verifier to classify and analyze errors. Their analysis reveals that prevailing benchmarks predominantly assess direct evidence extraction, while multi-sentence synthesis and numerical reasoning remain severely underrepresented. High model performance largely reflects retrieval and textual entailment abilities rather than complex reasoning. Distinct error patterns emerge across domains: general-domain errors stem from lexical overlap bias, scientific-domain errors arise from excessive caution, and mathematical-domain failures result from inadequate arithmetic reasoning. These findings expose critical limitations in current evaluation protocols and offer concrete directions for developing more challenging and comprehensive fact-checking benchmarks.
📝 Abstract
Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.
Problem

Research questions and friction points this paper is trying to address.

claim verification
reasoning traces
dataset bias
evaluation benchmarks
reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning trace analysis
claim verification
dataset bias
error profiling
evidence synthesis
🔎 Similar Papers
No similar papers found.