ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a fundamental limitation of large language models (LLMs) in non-monotonic logical reasoning, specifically on constraint satisfaction problem (CSP)-driven logic grid puzzles. To address this, we introduce ZebraLogic—a novel evaluation framework enabling systematic generation and quantitative assessment of logic puzzles with controllable complexity. Experiments across leading models (Llama, o1, DeepSeek-R1) reveal that LLM accuracy decays exponentially with increasing search-space complexity; scaling model parameters or inference budget fails to mitigate this degradation, confirming a “complexity curse.” Moreover, mitigation strategies—including Best-of-N sampling, explicit backtracking, and self-verification prompting—yield only marginal, diminishing returns. Our findings establish ZebraLogic as a rigorous benchmark for probing logical robustness and provide theoretical insights into the inherent constraints of LLMs in structured, non-monotonic reasoning tasks.

Technology Category

Application Category

📝 Abstract
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Logical Reasoning
Complex Mathematical Problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

ZebraLogic
Logical Reasoning Assessment
Enhancement Strategies for AI Models
🔎 Similar Papers
No similar papers found.