🤖 AI Summary
This paper identifies a critical vulnerability of large language models (LLMs) in set membership queries—a foundational deterministic reasoning task—despite their proficiency in complex inference. LLMs exhibit pronounced instability and unpredictability when performing simple, logically unambiguous membership judgments.
Method: The authors introduce a minimalist, task-driven paradigm for large-scale systematic evaluation, systematically varying prompt wording, semantic structure, and element ordering while conducting multidimensional empirical analysis across leading LLMs.
Contribution/Results: Experiments provide the first systematic evidence that LLMs possess fragmented, surface-level understanding of basic set concepts; their decisions are highly sensitive to superficial syntactic and lexical cues, revealing fundamental deficiencies in deep logical reasoning. This work establishes a reproducible, scalable benchmark for assessing LLM reliability and challenges the misconception that “simple tasks imply simple understanding,” thereby advancing foundational research on LLM reasoning robustness.
📝 Abstract
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set {pear, plum, apple, raspberry}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.