Benchmarking LLMs on the Semantic Overlap Summarization Task

πŸ“… 2024-02-26
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses Semantic Overlap Summarization (SOS), a novel constrained multi-document summarization task requiring precise extraction of shared semantics between two alternative narratives. We systematically evaluate 15 state-of-the-art large language models (LLMs) on SOS. To this end, we propose the first dedicated LLM evaluation framework for SOS, comprising a curated, open-source benchmark dataset; a structured prompting template system grounded in the TELeR taxonomy; and a multi-dimensional evaluation metric integrating ROUGE, BERTScore, and our novel SEM-F1β€”designed to quantify semantic overlap fidelity. Experimental results reveal substantial performance disparities across LLMs and consistent bottlenecks: weak logical consistency and insufficient fine-grained semantic alignment in overlap extraction. Our contributions include a reproducible benchmark, methodological tools, and open resources to advance research on constrained summarization.

Technology Category

Application Category

πŸ“ Abstract
Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. While recent advancements in Large Language Models (LLMs) have achieved superior performance in numerous summarization tasks, a benchmarking study of the SOS task using LLMs is yet to be performed. As LLMs' responses are sensitive to slight variations in prompt design, a major challenge in conducting such a benchmarking study is to systematically explore a variety of prompts before drawing a reliable conclusion. Fortunately, very recently, the TELeR taxonomy has been proposed which can be used to design and explore various prompts for LLMs. Using this TELeR taxonomy and 15 popular LLMs, this paper comprehensively evaluates LLMs on the SOS Task, assessing their ability to summarize overlapping information from multiple alternative narratives. For evaluation, we report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives. We conclude the paper by analyzing the strengths and limitations of various LLMs in terms of their capabilities in capturing overlapping information The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D.
Problem

Research questions and friction points this paper is trying to address.

Benchmark LLMs on Semantic Overlap Summarization (SOS) task
Introduce PrivacyPolicyPairs dataset for SOS benchmarks
Evaluate LLM summaries using TELeR and human assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing PrivacyPolicyPairs dataset for SOS
Using TELeR prompting taxonomy for evaluation
Benchmarking LLMs on Semantic Overlap Summarization
πŸ”Ž Similar Papers
No similar papers found.
J
John Salvador
Department of Computer Science and Software Engineering, College of Engineering, Auburn University
Naman Bansal
Naman Bansal
Doctoral Student, Auburn University
NLPCVDeep LearningExplainable Artificial Intelligence
Mousumi Akter
Mousumi Akter
TU Dortmund & Research Center Trustworthy Data Science and Security
Machine LearningNatural Language ProcessingData Privacy
S
Souvik Sarkar
Department of Computer Science and Software Engineering, College of Engineering, Auburn University
A
Anupam Das
Department of Computer Science and Software Engineering, College of Engineering, Auburn University
S
Shubhra Kanti Karmaker
Department of Computer Science and Software Engineering, College of Engineering, Auburn University