Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Current automatic evaluation of Grammatical Error Correction (GEC) predominantly relies on absolute scoring followed by averaging, whereas human evaluation is inherently sentence-level and comparative—typically involving pairwise judgments—leading to systematic misalignment between automatic metrics and human preferences. This work is the first to systematically identify and analyze this fundamental discrepancy. We propose a ranking-based meta-evaluation framework that unifies diverse metric outputs—including edit distance, n-gram overlap, semantic similarity, and BERT/GPT-derived features—into pairwise preference relations. These relations are then aggregated consistently using probabilistic ranking models (e.g., Bradley–Terry). Evaluated on the SEEDA benchmark, our approach significantly improves correlation with human judgments across all metrics; notably, several BERT-based variants surpass GPT-4 in alignment with human preferences. We publicly release both the unified evaluation implementation and a reproducible meta-evaluation protocol.

Technology Category

Application Category

📝 Abstract

One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, $n$-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. We publish our unified implementation of the metrics and meta-evaluations.

Problem

Research questions and friction points this paper is trying to address.

Align automatic GEC evaluations with human methods

Improve ranking accuracy of GEC systems

Propose new aggregation method for metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment with human evaluation methods

Aggregation of sentence-level relative results

Experimentation with diverse GEC metrics

🔎 Similar Papers

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction