On the Brittleness of LLMs: A Journey around Set Membership

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a critical vulnerability of large language models (LLMs) in set membership queries—a foundational deterministic reasoning task—despite their proficiency in complex inference. LLMs exhibit pronounced instability and unpredictability when performing simple, logically unambiguous membership judgments. Method: The authors introduce a minimalist, task-driven paradigm for large-scale systematic evaluation, systematically varying prompt wording, semantic structure, and element ordering while conducting multidimensional empirical analysis across leading LLMs. Contribution/Results: Experiments provide the first systematic evidence that LLMs possess fragmented, surface-level understanding of basic set concepts; their decisions are highly sensitive to superficial syntactic and lexical cues, revealing fundamental deficiencies in deep logical reasoning. This work establishes a reproducible, scalable benchmark for assessing LLM reliability and challenges the misconception that “simple tasks imply simple understanding,” thereby advancing foundational research on LLM reasoning robustness.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set {pear, plum, apple, raspberry}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM brittleness on simple set membership queries
Analyzing performance variations across phrasing and model choices
Mapping failure modes in fundamental reasoning tasks systematically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of set membership queries
Large-scale controlled experiments on LLM brittleness
Mapping failure modes through simplified problem design
L
Lea Hergert
University of Szeged, Hungary
Gábor Berend
Gábor Berend
University of Szeged, Hungary
Mario Szegedy
Mario Szegedy
Professor of Computer Science, Rutgers University
Complexity TheoryCombinatoricsQuantum ComputingGeometry
G
György Turán
University of Illinois at Chicago, USA
M
Márk Jelasity
HUN-REN–SZTE Research Group on AI, Hungary