Can We Catch the Elephant? A Survey of the Evolvement of Hallucination Evaluation on Natural Language Generation

📅 2024-04-18

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Hallucination in large language models (LLMs) necessitates reliable, comparable automated hallucination evaluation (AHE) methods; however, existing AHE approaches are fragmented and lack a unified theoretical foundation. Method: We conduct a systematic literature review of 2018–2024 publications, proposing the first three-dimensional analytical framework grounded in hallucination granularity (fact-level), evaluator design principles, and evaluation dimensions. We further perform bibliometric analysis, cross-model comparative evaluation, and systematic meta-analysis. Contribution/Results: Our work reveals the co-evolutionary pattern between AHE paradigms and generative model capabilities, establishes the first structured taxonomy of AHE methods, and identifies critical evaluation blind spots. Collectively, this study provides both theoretical grounding and practical guidance for designing trustworthy natural language generation (NLG) benchmarks and assessing LLM reliability.

Technology Category

Application Category

📝 Abstract

Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammaticality of generated text. As the capabilities of text generation models have improved, researchers have begun to pay more attention to the phenomenon of hallucination. Despite significant progress in this field in recent years, the evaluation system for hallucination is complex and diverse, lacking clear organization. We are the first to comprehensively survey how various evaluation methods have evolved with the development of text generation models from three dimensions, including hallucinated fact granularity, evaluator design principles, and assessment facets. This survey aims to help researchers identify current limitations in hallucination evaluation and highlight future research directions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucinations in Large Language Models (LLMs) accurately

Addressing methodological fragmentation in Automatic Hallucination Evaluation (AHE)

Developing unified evaluation frameworks for pre- and post-LLM methodologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive analysis of 74 evaluation methods

Unified pipeline for datasets and benchmarks

Enhanced interpretability mechanisms integration

🔎 Similar Papers

No similar papers found.