LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing static benchmarks inadequately assess the robustness of large language models (LLMs) under logically equivalent transformations, often leading to overestimation of their reasoning capabilities. To address this limitation, this work proposes the LGMT framework, which for the first time integrates equivalence relations from first-order logic into metamorphic testing to generate semantically invariant test cases. By evaluating cross-case consistency in model outputs—without requiring ground-truth answers—the framework detects reasoning flaws and enables reliable, annotation-free evaluation. Combining formal logical derivation, semantics-preserving test generation, and advanced prompting strategies such as few-shot chain-of-thought (CoT), experiments across six prominent LLMs reveal that these models are highly sensitive to both symbolic- and conclusion-level perturbations. Moreover, current prompting techniques only partially mitigate these issues, uncovering deep-seated vulnerabilities that conventional benchmarks fail to expose.
📝 Abstract
Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.
Problem

Research questions and friction points this paper is trying to address.

LLMs
logical reasoning
metamorphic testing
reasoning reliability
evaluation robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

metamorphic testing
logical reasoning
large language models
first-order logic
reasoning robustness
🔎 Similar Papers
No similar papers found.
Z
Zenghui Zhou
School of Automation Science and Electrical Engineering, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China
M
Man Li
School of Automation Science and Electrical Engineering, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China
X
Xiaoke Fang
School of Automation Science and Electrical Engineering, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China
X
Xinyi Zhou
School of Automation Science and Electrical Engineering, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China
W
Weibin Lin
School of Automation Science and Electrical Engineering, Beihang University, Xueyuan Road, Haidian District, Beijing, 100191, China
Zheng Zheng
Zheng Zheng
北京航空航天大学 教授
软件可靠性,人工智能