π€ AI Summary
This work addresses the tendency of language models to generate plausible yet factually false responses about non-existent concepts, revealing their limited ability to recognize the boundaries of their knowledgeβa critical vulnerability in high-stakes applications. To systematically investigate this issue, we introduce PhantomBench, the first large-scale benchmark comprising over 60,000 human-crafted non-existent terms and entities, generated via a scalable pipeline that derives them from real-world concepts across diverse domains. We evaluate 21 prominent language models on their hallucination behavior and capacity to abstain from answering, finding an average hallucination rate of 86.7% and limited willingness to refuse responses, even among state-of-the-art models. PhantomBench thus provides a rigorous tool for probing and advancing research on model awareness of cognitive limits.
π Abstract
Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.