DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limitations of existing hallucination categorization methods for large language models, which fail to effectively guide uncertainty scorers in identifying specific error types. The authors propose DECK, a novel hallucination taxonomy centered on detectability, which constructs a 2×2 behavioral quadrant based on inter-sample consistency and token-level confidence to map distinct hallucination types to corresponding families of scorers. By partitioning the scoring axis using Youden’s J statistic and integrating black-box consistency, white-box token probabilities, and LLM-as-a-Judge techniques, the framework systematically evaluates detection performance under external label distributions. Experiments across three models and four datasets demonstrate that the DECK quadrants reveal a critical blind spot in current output-level uncertainty methods: their consistent failure to detect fabricated content that exhibits both high confidence and high consistency.

📝 Abstract

Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.

Problem

Research questions and friction points this paper is trying to address.

LLM hallucinations

uncertainty quantification

detectability

taxonomy

output-level UQ

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination taxonomy

uncertainty quantification

consistency scoring