QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization

📅 2024-12-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual evaluation of summary content coverage lacks systematicity and fine-grained analysis. Method: This paper proposes QAPyramid—the first framework to integrate QA-SRL (Question-Answer Semantic Role Labeling) into pyramid evaluation—systematically decomposing reference summaries into reproducible, atomic question-answer (QA) units for content coverage assessment. It eliminates the need for expert annotation while ensuring high inter-annotator consistency and introduces an automated scoring metric combining BERTScore and exact match. Contribution/Results: On CNN/DailyMail, we constructed 8.9K QA-level annotations to evaluate 10 summarization systems. QAPyramid significantly enhances evaluation systematicity and discriminative power: its automated metric achieves a correlation of ≥0.82 with human judgments—surpassing ROUGE, BERTScore, and other state-of-the-art metrics.

Technology Category

Application Category

📝 Abstract
How to properly conduct human evaluations for text summarization is a longstanding challenge. The Pyramid human evaluation protocol, which assesses content selection by breaking the reference summary into sub-units and verifying their presence in the system summary, has been widely adopted. However, it suffers from a lack of systematicity in the definition and granularity of the sub-units. We address these problems by proposing QAPyramid, which decomposes each reference summary into finer-grained question-answer (QA) pairs according to the QA-SRL framework. We collect QA-SRL annotations for reference summaries from CNN/DM and evaluate 10 summarization systems, resulting in 8.9K QA-level annotations. We show that, compared to Pyramid, QAPyramid provides more systematic and fine-grained content selection evaluation while maintaining high inter-annotator agreement without needing expert annotations. Furthermore, we propose metrics that automate the evaluation pipeline and achieve higher correlations with QAPyramid than other widely adopted metrics, allowing future work to accurately and efficiently benchmark summarization systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistent granularity in summarization content evaluation methods
Proposes QA-based framework for systematic content selection assessment
Automates evaluation metrics to correlate with human judgment standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses QA pairs for fine-grained content evaluation
Automates evaluation with high correlation metrics
Maintains inter-annotator agreement without expert input
🔎 Similar Papers
No similar papers found.