🤖 AI Summary
This study systematically investigates fundamental limitations of large language models (LLMs) in solving cryptic crosswords—a domain demanding multi-layered linguistic operations including punning, morpheme reversal, and hidden-word extraction. To address the absence of standardized evaluation, we introduce the first open-source, domain-specific benchmark, featuring human-annotated clue parsing and answer verification protocols. We further propose an interpretability framework that systematically attributes LLM failures to deficits in structured lexical derivation and cross-modal semantic binding. Comprehensive zero-shot and few-shot evaluations across Gemma2, LLaMA3, and ChatGPT reveal a peak accuracy of only 28.3%, substantially below expert human performance. These results expose an intrinsic bottleneck in current LLMs: their inability to perform complex symbol–semantics co-reasoning under constrained, rule-governed linguistic transformations.
📝 Abstract
Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.