PhantomBench: Benchmarking the Non-existential Threat of Language Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the tendency of language models to generate plausible yet factually false responses about non-existent concepts, revealing their limited ability to recognize the boundaries of their knowledge—a critical vulnerability in high-stakes applications. To systematically investigate this issue, we introduce PhantomBench, the first large-scale benchmark comprising over 60,000 human-crafted non-existent terms and entities, generated via a scalable pipeline that derives them from real-world concepts across diverse domains. We evaluate 21 prominent language models on their hallucination behavior and capacity to abstain from answering, finding an average hallucination rate of 86.7% and limited willingness to refuse responses, even among state-of-the-art models. PhantomBench thus provides a rigorous tool for probing and advancing research on model awareness of cognitive limits.

📝 Abstract

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

Problem

Research questions and friction points this paper is trying to address.

hallucination

language models

non-existent entities

knowledge boundary

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination

language models

PhantomBench