Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of developing effective natural language processing models for low-resource languages such as Nahuatl, which suffer from scarce corpora and exhibit agglutinative and polysynthetic morphology. The authors propose the first application of controlled incremental repetition—a data augmentation technique—to systematically expand the limited π-yalli corpus. Static word embeddings are then trained on this augmented dataset. Experimental results demonstrate that embeddings derived from the incrementally repeated data achieve significant and robust improvements over those trained on the original corpus in sentence-level semantic similarity tasks. These findings validate the efficacy and potential of controlled incremental repetition as a viable strategy for enhancing semantic representation learning in extremely low-resource linguistic settings.
📝 Abstract
In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.
Problem

Research questions and friction points this paper is trying to address.

corpora deduplication
low-resource languages
Nahuatl
data duplication
Natural Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

incremental duplication
low-resource languages
corpus expansion
static embeddings
Nahuatl
🔎 Similar Papers
No similar papers found.
J
Juan-José Guzman-Landa
LIA, Université d’Avignon, Avignon, France
Juan-Manuel Torres-Moreno
Juan-Manuel Torres-Moreno
Université d'Avignon / Polytechnique Montréal
Traitement Automatique des LanguesNahuatlLanguage EngineeringSummarization
G
Graham Ranger
LCTT, Université d’Avignon, Avignon, France
M
Miguel Figueroa-Saavedra
Instituto de Investigaciones en Educación, Universidad Veracruzana, Xalapa, Mexico
M
Martha-Lorena Avendaño-Garrido
Facultad de Matemáticas, Universidad Veracruzana, Xalapa, Mexico
E
Elvys Linhares-Pontes
Trading Central, France
Luis-Gil Moreno-Jiménez
Luis-Gil Moreno-Jiménez
PhD Student
Sentiment analysisLinguistic CorporaAutomatic Text GenerationNamed Entity RecognitionClassification