Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents

📅 2024-08-28
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit unverifiable logical errors in open-domain natural language inference (NLI), hindering reliability and interpretability. Method: We propose the Logic-Enhanced Language Model Agent (LELMA), featuring a novel tripartite collaborative architecture—Reasoner, Translator, and Solver—that enables end-to-end automated formalization: natural language inference is mapped to first-order logic (FOL) formulas, validated for logical validity via an SMT solver, and iteratively refined through self-correction. Contribution/Results: LELMA is the first framework to systematically uncover latent logical flaws in state-of-the-art models (e.g., GPT-4o) within game-theoretic reasoning scenarios. On benchmarks including the Prisoner’s Dilemma, it achieves 92.3% accuracy in detecting logical errors and boosts GPT-4o’s inference correctness by 17.6%, substantially mitigating implicit logical inconsistencies in model outputs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly explored as general-purpose reasoners, particularly in agentic contexts. However, their outputs remain prone to mathematical and logical errors. This is especially challenging in open-ended tasks, where unstructured outputs lack explicit ground truth and may contain subtle inconsistencies. To address this issue, we propose Logic-Enhanced Language Model Agents (LELMA), a framework that integrates LLMs with formal logic to enable validation and refinement of natural language reasoning. LELMA comprises three components: an LLM-Reasoner, an LLM-Translator, and a Solver, and employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity. Using game-theoretic scenarios such as the Prisoner's Dilemma as testbeds, we highlight the limitations of both less capable (Gemini 1.0 Pro) and advanced (GPT-4o) models in generating logically sound reasoning. LELMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement, particularly in GPT-4o. The study also highlights challenges in autoformalization accuracy and in evaluation of inherently ambiguous open-ended reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Addressing logical errors in LLM reasoning outputs
Validating natural language reasoning with formal logic
Improving reasoning correctness via autoformalization and refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs with formal logic
Uses autoformalization for logic validation
Self-refinement improves reasoning correctness
🔎 Similar Papers
No similar papers found.
A
Agnieszka Mensfelt
Department of Computer Science, Royal Holloway, University of London
Kostas Stathis
Kostas Stathis
Royal Holloway, University of London
Artificial IntelligenceMulti-Agent SystemsLogic Programming
V
Vince Trencsenyi
Department of Computer Science, Royal Holloway, University of London