Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated peer-review systems—particularly large language models (LLMs)—struggle to detect logical inconsistencies among results, interpretations, and claims in scientific papers, undermining review reliability. Method: We introduce the first controllable counterfactual evaluation framework for detecting research-logic flaws, systematically generating counterfactually corrupted samples with injected logical errors to quantitatively assess the consistency-aware reasoning capabilities of state-of-the-art LLM-based review methods. Contribution/Results: Experiments reveal that current automated review approaches exhibit no significant sensitivity to logical inconsistencies, exposing critical robustness deficits in their reasoning. To address this, we propose three practical enhancements: (1) explicit logical structure modeling, (2) counterfactual contrastive prompting, and (3) multi-hop verification. We publicly release a counterfactual dataset and an end-to-end evaluation pipeline—the first standardized benchmark for logical consistency assessment in automated scientific review.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
Problem

Research questions and friction points this paper is trying to address.

Evaluating automatic reviewers' faulty reasoning detection
Testing internal consistency in research logic assessment
Identifying limitations in automatic review generation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual framework tests reasoning detection
Evaluates internal consistency in research logic
Automated system isolates flaws under controlled conditions
🔎 Similar Papers
No similar papers found.