Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction

πŸ“… 2025-09-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing grammatical error correction (GEC) datasets suffer from severe multilingual undercoverage, hindering the development of non-English GEC models. To address this, we introduce OmniGECβ€”the first large-scale, silver-standard multilingual GEC dataset covering 11 languages. It integrates Wikipedia edit histories, Reddit user posts, and UberText 2.0, with high-quality automatic corrections generated by GPT-4o-mini. Unlike conventional sentence-level annotations, OmniGEC supports paragraph-level error correction modeling. Leveraging this dataset, we fine-tune Aya-Expanse and Gemma-3, achieving state-of-the-art performance on multilingual GEC benchmarks. All data and best-performing models are publicly released on Hugging Face, substantially alleviating the scarcity of non-English GEC resources and advancing multilingual grammatical error correction research.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we introduce OmniGEC, a collection of multilingual silver-standard datasets for the task of Grammatical Error Correction (GEC), covering eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish, and Ukrainian. These datasets facilitate the development of multilingual GEC solutions and help bridge the data gap in adapting English GEC solutions to multilingual GEC. The texts in the datasets originate from three sources: Wikipedia edits for the eleven target languages, subreddits from Reddit in the eleven target languages, and the Ukrainian-only UberText 2.0 social media corpus. While Wikipedia edits were derived from human-made corrections, the Reddit and UberText 2.0 data were automatically corrected with the GPT-4o-mini model. The quality of the corrections in the datasets was evaluated both automatically and manually. Finally, we fine-tune two open-source large language models - Aya-Expanse (8B) and Gemma-3 (12B) - on the multilingual OmniGEC corpora and achieve state-of-the-art (SOTA) results for paragraph-level multilingual GEC. The dataset collection and the best-performing models are available on Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Creating multilingual silver-standard datasets for grammatical error correction
Bridging data gap in adapting English GEC to multiple languages
Developing multilingual GEC solutions using diverse text sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual silver-standard datasets from Wikipedia, Reddit, UberText
Automated correction using GPT-4o-mini model for data generation
Fine-tuned Aya-Expanse and Gemma-3 models achieve SOTA results
πŸ”Ž Similar Papers
No similar papers found.
R
Roman Kovalchuk
Ukrainian Catholic University, Softserve
Mariana Romanyshyn
Mariana Romanyshyn
Computational linguist, Grammarly
Natural language processingcomputational linguisticsUkrainian NLP
P
Petro Ivaniuk
Softserve