From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from pervasive hallucinations in multi-document summarization (MDS), undermining factual consistency and reliability. Method: We systematically characterize hallucination patterns and root causes, introducing two novel, topic-oriented MDS hallucination benchmarks—first to support fine-grained topic annotation and causal attribution. Leveraging high-quality human-annotated news and dialogue datasets (700+ samples), we quantitatively evaluate five state-of-the-art LLMs. Contribution/Results: Our analysis reveals a 75% average hallucination rate in MDS, with pronounced long-tail distribution and forced generation even in topic-absent contexts—e.g., GPT-3.5-turbo and GPT-4o exhibit 79.35% and 44% hallucination rates, respectively, under no-topic conditions. Instruction-following failure and overgeneralization are identified as primary drivers. We publicly release all annotated datasets and evaluation code to advance trustworthy MDS research.

Technology Category

Application Category

📝 Abstract
Although many studies have investigated and reduced hallucinations in large language models (LLMs) for single-document tasks, research on hallucination in multi-document summarization (MDS) tasks remains largely unexplored. Specifically, it is unclear how the challenges arising from handling multiple documents (e.g., repetition and diversity of information) affect models outputs. In this work, we investigate how hallucinations manifest in LLMs when summarizing topic-specific information from multiple documents. Since no benchmarks exist for investigating hallucinations in MDS, we use existing news and conversation datasets, annotated with topic-specific insights, to create two novel multi-document benchmarks. When evaluating 5 LLMs on our benchmarks, we observe that on average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. Moreover, when summarizing non-existent topic-related information, gpt-3.5-turbo and GPT-4o still generate summaries about 79.35% and 44% of the time, raising concerns about their tendency to fabricate content. To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights. Motivated by these observations, we investigate the efficacy of simple post-hoc baselines in mitigating hallucinations but find them only moderately effective. Our results underscore the need for more effective approaches to systematically mitigate hallucinations in MDS. We release our dataset and code at github.com/megagonlabs/Hallucination_MDS.
Problem

Research questions and friction points this paper is trying to address.

Study hallucinations in multi-document summarization by LLMs
Create benchmarks for evaluating hallucinations in MDS
Assess LLM tendency to fabricate non-existent topic information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created multi-document benchmarks for hallucination study
Evaluated 5 LLMs revealing high hallucination rates
Proposed post-hoc baselines for hallucination mitigation
🔎 Similar Papers
No similar papers found.