Can We Catch the Elephant? A Survey of the Evolvement of Hallucination Evaluation on Natural Language Generation

πŸ“… 2024-04-18
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Hallucination in large language models (LLMs) necessitates reliable, comparable automated hallucination evaluation (AHE) methods; however, existing AHE approaches are fragmented and lack a unified theoretical foundation. Method: We conduct a systematic literature review of 2018–2024 publications, proposing the first three-dimensional analytical framework grounded in hallucination granularity (fact-level), evaluator design principles, and evaluation dimensions. We further perform bibliometric analysis, cross-model comparative evaluation, and systematic meta-analysis. Contribution/Results: Our work reveals the co-evolutionary pattern between AHE paradigms and generative model capabilities, establishes the first structured taxonomy of AHE methods, and identifies critical evaluation blind spots. Collectively, this study provides both theoretical grounding and practical guidance for designing trustworthy natural language generation (NLG) benchmarks and assessing LLM reliability.

Technology Category

Application Category

πŸ“ Abstract
Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammaticality of generated text. As the capabilities of text generation models have improved, researchers have begun to pay more attention to the phenomenon of hallucination. Despite significant progress in this field in recent years, the evaluation system for hallucination is complex and diverse, lacking clear organization. We are the first to comprehensively survey how various evaluation methods have evolved with the development of text generation models from three dimensions, including hallucinated fact granularity, evaluator design principles, and assessment facets. This survey aims to help researchers identify current limitations in hallucination evaluation and highlight future research directions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucinations in Large Language Models (LLMs) accurately
Addressing methodological fragmentation in Automatic Hallucination Evaluation (AHE)
Developing unified evaluation frameworks for pre- and post-LLM methodologies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive analysis of 74 evaluation methods
Unified pipeline for datasets and benchmarks
Enhanced interpretability mechanisms integration
πŸ”Ž Similar Papers
No similar papers found.