🤖 AI Summary
This work addresses the problem of inflated performance estimates in LLM reasoning evaluation due to benchmark data exposure. To mitigate memorization bias, we introduce the first memory-resistant linguistic reasoning benchmark. Methodologically, we innovatively combine linguistically grounded templated generation with orthographic confusion across writing systems—leveraging controlled symbol substitution and cross-script consistency modeling to dynamically generate semantically equivalent yet representationally novel question variants, thereby effectively isolating training-data contamination. Empirical evaluation reveals that leading models—including OpenAI o1-preview and DeepSeek R1—exhibit an average 18.7% accuracy drop on confused text versus native script, demonstrating for the first time that their reasoning capabilities are substantially overestimated and critically dependent on surface-level textual representations. Our framework establishes a scalable, interpretable paradigm for debiased reasoning assessment.
📝 Abstract
Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.