Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing logical reasoning benchmarks rely heavily on人工-crafted, oversimplified, or unnatural samples, limiting comprehensive evaluation of large language models’ (LLMs) complex reasoning capabilities. Method: We propose SmartyPat—a novel framework featuring (i) SmartyPat-Bench, the first high-quality, fine-grained, and highly diverse natural-language logical fallacy benchmark derived from real Reddit posts; and (ii) a Prolog-based, logic-programming-driven test oracle enabling interpretable, scalable fallacy sample generation—whose quality, after LLM-based refinement, matches human-authored instances. Contribution/Results: Empirical analysis reveals a non-monotonic relationship between reasoning step count and fallacy identification/classification performance; structured reasoning improves accuracy, whereas redundant steps degrade detection efficacy.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating logical reasoning in LLMs with realistic benchmarks
Automating fallacy generation using logic programming and LLMs
Analyzing impact of reasoning steps on fallacy detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SmartyPat-Bench with real-world Reddit posts
Uses Prolog rules for automated fallacy generation
Refines statements with LLMs for natural fluency
🔎 Similar Papers
No similar papers found.