C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that current large language models (LLMs), when serving as evaluators of chain-of-thought (CoT) reasoning, struggle to reliably assess the faithfulness of reasoning processes—particularly with respect to causality and coverage. To tackle this, the authors propose C2-Faith, a novel benchmark that systematically disentangles these two dimensions of faithfulness for the first time. Built upon the PRM800K dataset, C2-Faith employs controlled perturbations to generate samples with known causal errors or missing critical steps, enabling a quantifiable and controllable evaluation framework. Experimental results on fine-grained tasks—causal error detection, error localization, and coverage scoring—reveal significant inconsistencies across state-of-the-art LLMs: no model consistently excels across all tasks; error localization lags behind detection capabilities; and coverage scores systematically overestimate incomplete reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation
Problem

Research questions and friction points this paper is trying to address.

faithfulness
chain-of-thought
causality
coverage
LLM judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

C2-Faith
causal faithfulness
coverage faithfulness
LLM judges
chain-of-thought reasoning
🔎 Similar Papers
No similar papers found.