🤖 AI Summary
This study re-examines whether large language models (LLMs) truly surpass humans in abstractive summarization and investigates whether text summarization remains an open research problem. Employing a multidimensional evaluation framework that integrates human assessment, debiased LLM-as-Judge scoring, external knowledge-based fact-checking, and corpus-level linguistic analysis, the authors systematically compare summary quality across five datasets and five state-of-the-art LLMs. The findings reveal that human-generated summaries significantly outperform model outputs in informativeness and factual faithfulness, while LLMs exhibit only marginal advantages in surface-level fluency. Moreover, model-generated summaries display stylistic homogenization and have not yet surpassed the human performance ceiling. This work introduces a comprehensive evaluation paradigm that highlights critical limitations of current LLMs in reasoning synthesis and factual consistency.
📝 Abstract
The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.