CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant deficiencies in formal counterfactual reasoning, yet no rigorous benchmark exists to assess this capability. Method: We introduce CounterBench—the first causal-rule-based counterfactual reasoning benchmark (1K instances)—covering diverse causal graph structures and semantic confounding variants. We propose CoIn, a novel inference paradigm integrating iterative reasoning, backtracking search, and prompt-guided exploration of the counterfactual solution space. Our evaluation framework is grounded in formal causal rules, and we publicly release both the dataset and evaluation code. Contribution/Results: Experiments reveal that state-of-the-art LLMs perform near-chance on CounterBench (~20% accuracy), while CoIn boosts their performance by an average of +35.2%, with consistent cross-model generalization. This work establishes a principled benchmark and methodology for quantitatively evaluating and controllably enhancing causal reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in counterfactual reasoning

Introduce CounterBench benchmark dataset

Propose CoIn to improve LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

New benchmark dataset CounterBench

CoIn iterative reasoning paradigm

Formal rules for counterfactual inference

🔎 Similar Papers

Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning