🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine logical reasoning capabilities—or merely exploit statistical shortcuts—by rigorously evaluating their performance on NP-complete 3-SAT problems. Method: Leveraging the well-established phase transition phenomenon in random 3-SAT—where computational hardness undergoes a sharp, theoretically characterized threshold—we construct the first reasoning evaluation framework calibrated to computational phase transitions. We generate controllable, difficulty-graded random 3-SAT instances and assess LLMs using chain-of-thought prompting across multiple models. Contribution/Results: Empirical results reveal a steep accuracy drop for mainstream LLMs beyond the phase transition point, whereas DeepSeek-R1 demonstrates markedly superior generalization robustness, sustaining stable reasoning performance even in the absence of statistical cues. This suggests potential acquisition of symbolic reasoning mechanisms. Our work pioneers the integration of statistical physics–inspired phase transition theory into LLM reasoning evaluation, establishing a novel, quantifiable, and interpretable paradigm for assessing logical reasoning competence.
📝 Abstract
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.