Don't Judge a Book by its Cover: Testing LLMs'Robustness Under Logical Obfuscation

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the marked performance degradation of large language models (LLMs) when confronted with logically equivalent but superficially obfuscated problems, revealing their reliance on surface form rather than deep semantic understanding. To tackle this limitation, the authors propose Logifus, the first structure-preserving logical obfuscation framework, and introduce LogiQAte, a comprehensive benchmark specifically designed to evaluate LLMs’ reasoning robustness under logical equivalence transformations. LogiQAte encompasses four task types: equivalence rewriting, indirect relational chaining, symbolic substitution, and frame-of-reference shifting. Zero-shot evaluations across six state-of-the-art models demonstrate substantial performance drops under obfuscation—e.g., an average 47% decline for GPT-4o—highlighting a fundamental deficiency in current LLMs’ capacity for logically robust reasoning.

Technology Category

Application Category

📝 Abstract

Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.

Problem

Research questions and friction points this paper is trying to address.

logical obfuscation

large language models

reasoning robustness

semantic understanding

equivalence-preserving transformation

Innovation

Methods, ideas, or system contributions that make the work stand out.

logical obfuscation

robustness evaluation

structure-preserving transformation