🤖 AI Summary
This study addresses the lack of systematic understanding of reasoning capabilities evaluated in existing fact-checking benchmarks. The authors propose the first decomposition of reasoning components in fact-checking, generating 24K structured reasoning traces using GPT-4o-mini and employing a lightweight 1B-parameter verifier to classify and analyze errors. Their analysis reveals that prevailing benchmarks predominantly assess direct evidence extraction, while multi-sentence synthesis and numerical reasoning remain severely underrepresented. High model performance largely reflects retrieval and textual entailment abilities rather than complex reasoning. Distinct error patterns emerge across domains: general-domain errors stem from lexical overlap bias, scientific-domain errors arise from excessive caution, and mathematical-domain failures result from inadequate arithmetic reasoning. These findings expose critical limitations in current evaluation protocols and offer concrete directions for developing more challenging and comprehensive fact-checking benchmarks.
📝 Abstract
Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.