Euska~nolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code-switching (CS) between Basque and Spanish has long hindered NLP research due to the scarcity of naturally occurring, high-quality corpora. This work addresses that gap by constructing Euska~nolDS—the first open-source, naturally sourced CS corpus for this language pair, grounded in authentic bilingual interactions from northern Iberia. Methodologically, we introduce a novel, systematic pipeline: initial filtering via multilingual identification models (fastText/langid), followed by rigorous manual validation and quality control led by linguistics experts to ensure annotation accuracy and reproducibility. The corpus spans both formal and informal registers and comprises thousands of meticulously verified CS sentence pairs. Euska~nolDS fills a critical void in CS benchmark resources for Basque–Spanish, establishing foundational infrastructure for CS modeling, evaluation, and low-resource bilingual NLP research.

Technology Category

Application Category

📝 Abstract
Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name Euska~nolDS.
Problem

Research questions and friction points this paper is trying to address.

Lack of Basque-Spanish code-switching data
Need for NLP models understanding CS
Development of a naturally sourced corpus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Naturally sourced corpus creation
Language identification model utilization
Manual validation for reliability
🔎 Similar Papers
No similar papers found.
Maite Heredia
Maite Heredia
PhD student, IXA, EHU
J
Jeremy Barnes
HiTZ Center - Ixa, University of the Basque Country UPV/EHU
A
A. Soroa
HiTZ Center - Ixa, University of the Basque Country UPV/EHU