A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource polysynthetic indigenous languages like Nahuatl suffer from severe scarcity of annotated training data, hindering effective language model development. Method: We propose a grammar-driven data augmentation framework: (1) we design and implement the first context-free grammar (CFG) for Nahuatl; (2) we systematically generate syntactically well-formed artificial sentences using CFG rules, enriched with FastText word embeddings for lexical realism; and (3) we integrate human validation to ensure grammatical correctness and semantic controllability. The augmented corpus is named π-yalli. Contribution/Results: Evaluated on sentence-level semantic tasks—including semantic similarity and textual entailment—models trained on π-yalli achieve substantial performance gains over baselines trained on raw scarce data; notably, compact models outperform general-purpose large language models on several tasks. This work establishes a reproducible, scalable methodology for grammar-informed corpus construction for low-resource indigenous languages, advancing both model training and evaluation in under-resourced linguistic settings.

Technology Category

Application Category

📝 Abstract
In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $π$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $π$- extsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.
Problem

Research questions and friction points this paper is trying to address.

Generating artificial sentences to augment scarce Nawatl language corpora
Creating context-free grammar for under-resourced Amerindian language Nawatl
Improving language model training for low-resource Nawatl through corpus expansion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-free grammar generates artificial Nawatl sentences
Augmented corpus trains FastText for semantic tasks
Grammar-based approach outperforms some LLMs
🔎 Similar Papers
No similar papers found.
J
Juan-José Guzmán-Landa
Laboratoire Informatique d’Avignon, Avignon Université, BP 84911 Agroparc Cedex 9, Avignon, France
Juan-Manuel Torres-Moreno
Juan-Manuel Torres-Moreno
Université d'Avignon / Polytechnique Montréal
Traitement Automatique des LanguesNahuatlLanguage EngineeringSummarization
M
Miguel Figueroa-Saavedra
Instituto de Investigaciones en Educación, Universidad Veracruzana, Xalapa, Mexico
L
Ligia Quintana-Torres
Facultad de Matemáticas, Universidad Veracruzana, Xalapa, Mexico
M
Martha-Lorena Avendaño-Garrido
Facultad de Matemáticas, Universidad Veracruzana, Xalapa, Mexico
G
Graham Ranger
Laboratoire Identités Culturelles, Textes et Théâtralité, Avignon Université, 74 Rue Louis Pasteur, 84029 Avignon, France