PreSumm: Predicting Summarization Performance Without Summarizing

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work investigates the root causes of performance variability in automatic summarization models and introduces PreSumm: a novel task that predicts summary quality directly from the source document—without generating any summary. Methodologically, we formalize “summary difficulty prediction” as a document-level regression or classification problem, constructing a lightweight predictor that integrates linguistic features, structural metrics, and pretrained contextual representations. Key contributions include: (i) identifying document coherence, topic clarity, and content complexity as robust, model-agnostic determinants of summarization difficulty; (ii) achieving high predictive correlation (ρ > 0.7) across diverse summarization models on multi-model benchmark datasets; and (iii) enabling precise identification of low-quality documents requiring human intervention, thereby significantly improving efficiency in hybrid summarization pipelines and dataset curation. This framework establishes a new paradigm for diagnosing, optimizing, and controllably evaluating summarization systems.

Technology Category

Application Category

📝 Abstract

Despite recent advancements in automatic summarization, state-of-the-art models do not summarize all documents equally well, raising the question: why? While prior research has extensively analyzed summarization models, little attention has been given to the role of document characteristics in influencing summarization performance. In this work, we explore two key research questions. First, do documents exhibit consistent summarization quality across multiple systems? If so, can we predict a document's summarization performance without generating a summary? We answer both questions affirmatively and introduce PreSumm, a novel task in which a system predicts summarization performance based solely on the source document. Our analysis sheds light on common properties of documents with low PreSumm scores, revealing that they often suffer from coherence issues, complex content, or a lack of a clear main theme. In addition, we demonstrate PreSumm's practical utility in two key applications: improving hybrid summarization workflows by identifying documents that require manual summarization and enhancing dataset quality by filtering outliers and noisy documents. Overall, our findings highlight the critical role of document properties in summarization performance and offer insights into the limitations of current systems that could serve as the basis for future improvements.

Problem

Research questions and friction points this paper is trying to address.

Predict summarization performance without generating summaries

Analyze document characteristics affecting summarization quality

Improve hybrid workflows and dataset quality using predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts summarization performance without summaries

Analyzes document properties for quality prediction

Identifies documents needing manual summarization

🔎 Similar Papers

No similar papers found.