Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit strong performance on simple commonsense reasoning but suffer from systematic deficiencies in complex commonsense tasks—particularly implicit, long-horizon causal reasoning (e.g., tracing prolonged event consequences). To address this gap, we introduce Com², the first causal-guided benchmark for complex commonsense reasoning. Our method integrates causal event graphs with *do*-calculus interventions to generate structured, multi-step causal scenarios; constructs a high-difficulty subset grounded in detective narratives; and proposes a logic-constrained “slow-thinking” paradigm for synthetic data generation. Empirical evaluation reveals dual limitations of LLMs in both reasoning depth and breadth. We demonstrate that post-training and slow-thinking strategies yield significant performance gains. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with complex implicit commonsense reasoning

Lack of structured benchmarks for complex commonsense tasks

Aligning reasoning depth with human-like causal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates causal event graphs for structured reasoning

Uses causal theory to modify event scenarios

Employs LLM with slow thinking for synthesis

🔎 Similar Papers

No similar papers found.

Authors to Follow