On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to determine the truth value of arbitrary first-order logic (FOL) sentences within the Zermelo–Fraenkel set theory with the Axiom of Choice (ZFC) framework. Method: We introduce the first complexity-controllable FOL sentence generation method, grounded in formal logic construction and explicit modeling of the ZFC axiomatic system, enabling fully automated, zero-prior-knowledge generation of pure formal-logic evaluation data. Contribution/Results: Based on this, we establish LogiBench—the first benchmark dedicated to rigorous formal reasoning, supporting zero-shot and few-shot evaluation. Empirical evaluation reveals substantial performance degradation in state-of-the-art LLMs (e.g., DeepSeek-R1, o3-mini) on medium-to-high-complexity tasks. All datasets, source code, and evaluation results are publicly released, providing a reproducible, extensible infrastructure for advancing research on logical reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.

Problem

Research questions and friction points this paper is trying to address.

Evaluating large language models' logical reasoning

Generating complex first-order logic datasets

Assessing model performance on set theory problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates complex first-order logic statements

Evaluates large language models' logical reasoning

Publicly shares datasets and evaluation code

🔎 Similar Papers

No similar papers found.