Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

Traditional n-gram-based evaluation metrics (e.g., ROUGE) exhibit inconsistent correlation with human judgments across typologically diverse languages, particularly underperforming in morphologically rich languages. Method: We conduct a cross-lingual comparative study across eight languages—spanning agglutinative, isolating, and high-fusion typologies—employing multi-strategy tokenization and large-scale human annotations for abstractive summarization. Contribution/Results: We find that n-gram metrics suffer significant performance degradation in inflectional languages, mitigated only partially by optimized tokenization; in contrast, task-oriented neural metrics (e.g., COMET), especially those fine-tuned on summarization data, maintain strong human correlation even in low-resource languages and consistently outperform ROUGE. This work is the first to empirically demonstrate the sensitivity of automatic evaluation metrics to linguistic morphology and establishes that training-aware neural metrics possess superior cross-lingual robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.

Problem

Research questions and friction points this paper is trying to address.

Assess n-gram and neural metrics for multilingual summarization evaluation

Analyze metric correlation with human judgments across eight diverse languages

Highlight limitations of n-gram metrics for fusional languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically assess n-gram and neural metrics

Design large-scale evaluation suite for eight languages

Advocate neural-based metrics for better correlation

🔎 Similar Papers

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark