Two CFG Nahuatl for automatic corpora expansion

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Nahuatl—a nationally endangered language in Mexico—suffers from a severe scarcity of digital resources and lacks grammatically correct, LLM-trainable corpora. Method: We propose the first dual context-free grammar (CFG)-driven syntactic generation framework to automatically construct large-scale, high-quality artificial sentence corpora specifically designed for learning non-contextual word and sentence embeddings. Our approach integrates linguistic constraints with rule-based generation to substantially improve grammatical accuracy and structural coverage. Results: Embedding models trained on our generated corpus achieve substantial gains in semantic similarity tasks over original baselines; even lightweight non-contextual embeddings outperform several mainstream large language models. This work establishes a reproducible, scalable paradigm for embedding learning in low-resource π-shaped languages.

Technology Category

Application Category

📝 Abstract
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
Problem

Research questions and friction points this paper is trying to address.

Develop CFGs to generate synthetic Nawatl sentences for corpus expansion
Address lack of digital resources for training language models in Nawatl
Enhance embeddings and semantic similarity tasks through artificial data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two new Context-Free Grammars for Nawatl
Generate artificial sentences to expand corpora
Use expanded corpus to learn non-contextual embeddings
🔎 Similar Papers
No similar papers found.
J
Juan-José Guzmán-Landa
Laboratoire Informatique d’Avignon, Avignon Université, France
Juan-Manuel Torres-Moreno
Juan-Manuel Torres-Moreno
Université d'Avignon / Polytechnique Montréal
Traitement Automatique des LanguesNahuatlLanguage EngineeringSummarization
M
Miguel Figueroa-Saavedra
Instituto de Investigaciones en Educación, Universidad Veracruzana, Xalapa, Mexico
L
Ligia Quintana-Torres
Facultad de Matemáticas, Universidad Veracruzana, Xalapa, Mexico
G
Graham Ranger
Laboratoire Identités Culturelles, Textes et Théâtralité, Avignon Université, France
M
Martha-Lorena Avendaño-Garrido
Facultad de Matemáticas, Universidad Veracruzana, Xalapa, Mexico