LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing static benchmarks inadequately assess the robustness of large language models (LLMs) under logically equivalent transformations, often leading to overestimation of their reasoning capabilities. To address this limitation, this work proposes the LGMT framework, which for the first time integrates equivalence relations from first-order logic into metamorphic testing to generate semantically invariant test cases. By evaluating cross-case consistency in model outputs—without requiring ground-truth answers—the framework detects reasoning flaws and enables reliable, annotation-free evaluation. Combining formal logical derivation, semantics-preserving test generation, and advanced prompting strategies such as few-shot chain-of-thought (CoT), experiments across six prominent LLMs reveal that these models are highly sensitive to both symbolic- and conclusion-level perturbations. Moreover, current prompting techniques only partially mitigate these issues, uncovering deep-seated vulnerabilities that conventional benchmarks fail to expose.

📝 Abstract

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

Problem

Research questions and friction points this paper is trying to address.

LLMs

logical reasoning

metamorphic testing

reasoning reliability

evaluation robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

metamorphic testing

logical reasoning

large language models