🤖 AI Summary
Hallucinations in large language models (LLMs) severely undermine their reliability, yet existing evaluation methods (e.g., KnowHulu) incur prohibitive computational overhead. Method: This paper proposes HHEM—a lightweight, LLM-free, standalone classification framework for hallucination assessment. Unlike prior approaches, HHEM does not require LLM self-reflection or generation. Contribution/Results: Its key innovations include (1) the first end-to-end classification paradigm independent of LLM generation processes; (2) a segment-wise retrieval mechanism to enhance fine-grained hallucination detection; and (3) novel insights—derived from CDF-based statistical analysis and non-fictionality verification—revealing an inverse U-shaped relationship between model scale and hallucination stability: 7B–9B models exhibit minimal hallucinations, whereas mid-sized models are most unstable. Experiments show HHEM reduces per-sample detection time from 8 hours to 10 minutes, achieving 82.2% accuracy and 78.9% true positive rate.
📝 Abstract
Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy (82.2%) and TPR (78.9%). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.