A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Low-resource polysynthetic indigenous languages like Nahuatl suffer from severe scarcity of annotated training data, hindering effective language model development. Method: We propose a grammar-driven data augmentation framework: (1) we design and implement the first context-free grammar (CFG) for Nahuatl; (2) we systematically generate syntactically well-formed artificial sentences using CFG rules, enriched with FastText word embeddings for lexical realism; and (3) we integrate human validation to ensure grammatical correctness and semantic controllability. The augmented corpus is named π-yalli. Contribution/Results: Evaluated on sentence-level semantic tasks—including semantic similarity and textual entailment—models trained on π-yalli achieve substantial performance gains over baselines trained on raw scarce data; notably, compact models outperform general-purpose large language models on several tasks. This work establishes a reproducible, scalable methodology for grammar-informed corpus construction for low-resource indigenous languages, advancing both model training and evaluation in under-resourced linguistic settings.

Technology Category

Application Category

📝 Abstract

In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $π$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $π$- extsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

Problem

Research questions and friction points this paper is trying to address.

Generating artificial sentences to augment scarce Nawatl language corpora

Creating context-free grammar for under-resourced Amerindian language Nawatl

Improving language model training for low-resource Nawatl through corpus expansion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-free grammar generates artificial Nawatl sentences

Augmented corpus trains FastText for semantic tasks

Grammar-based approach outperforms some LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow