π€ AI Summary
Existing GEC evaluation metrics suffer from fragmented implementations and inconsistent interfaces, leading to unfair system comparisons, poor reproducibility, and limited extensibility. To address this, we propose GECEvalβthe first modular, unified evaluation framework specifically designed for grammatical error correction. GECEval introduces a standardized API and integrates meta-evaluation, statistical analysis, and visualization (via Matplotlib/Seaborn), providing consistent, unified wrappers for mainstream metrics including ERRANT, M2, and GLEU. Implemented in Python and released under the MIT License as a production-ready PyPI package, GECEval models GEC evaluation as a plug-and-play, verifiable modular pipeline. This design significantly reduces evaluation bias and enhances fairness and reproducibility across system comparisons. Already adopted by multiple GEC research teams, GECEval advances methodological standardization in GEC evaluation.
π Abstract
We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.