Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

📅 2024-10-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the grammatical error correction (GEC) challenge for Zarma—a low-resource West African language—hampered by nonstandard orthography, scarce annotated data, and dialectal variation. We conduct the first systematic evaluation of rule-based systems, machine translation (MT) models, and small-scale multilingual LLMs for Zarma GEC. Our proposed multi-strategy framework integrates a rule engine, M2M100 (in zero-shot and fine-tuned settings), and mT5-small, trained on a novel benchmark comprising over 250,000 synthetically augmented and human-annotated samples—the first publicly available Zarma GEC dataset. Experimental results show that fine-tuned M2M100 achieves 95.82% error detection rate and 78.90% suggestion accuracy (human evaluation: 3.0/5.0), significantly outperforming both rule-based and LLM baselines; its successful cross-lingual transfer to Bambara further demonstrates generalizability. Key contributions include: (1) the first end-to-end Zarma GEC system; (2) the first methodology comparison for GEC in low-resource languages; (3) a reproducible synthetic data construction pipeline; and (4) empirical validation of cross-lingual transfer for African language GEC.

Technology Category

Application Category

📝 Abstract

Grammatical error correction (GEC) aims to improve quality and readability of texts through accurate correction of linguistic mistakes. Previous work has focused on high-resource languages, while low-resource languages lack robust tools. However, low-resource languages often face problems such as: non-standard orthography, limited annotated corpora, and diverse dialects, which slows down the development of GEC tools. We present a study on GEC for Zarma, spoken by over five million in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated them using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95. 82% and a suggestion accuracy of 78. 90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs -- MT5-small -- showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

Problem

Research questions and friction points this paper is trying to address.

Develop GEC tools for Zarma language

Address non-standard orthography and dialects

Compare rule-based, MT, and LLM approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Translation for GEC

M2M100 outperforms alternatives

MT models enhance low-resource GEC

🔎 Similar Papers

No similar papers found.

Authors to Follow