Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Resource-poor Quechua languages suffer from a severe scarcity of speech data, hindering the development of speech technologies. Method: Focusing on Puno Quechua (qxp), this work establishes an open, community-driven, ethics-first speech data curation framework leveraging the Common Voice platform. It integrates read and spontaneous speech collection, supports multilingual parallel recording, and implements Indigenous-led community validation—emphasizing data sovereignty and cultural appropriateness. Contribution/Results: The project systematically expands public speech resources to 17 Quechua varieties for the first time, yielding a 191.1-hour validated Quechua corpus (86% verified), including 12 hours of Puno Quechua (77% verified). This substantially enhances data availability for low-resource Indigenous languages and provides a replicable methodological paradigm for inclusive, multilingual speech technology development.

Technology Category

Application Category

📝 Abstract

Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86% validated), with Puno Quechua contributing 12 hours (77% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity for under-resourced Quechua languages

Integrating Quechua speech datasets into Common Voice platform

Developing inclusive voice technology with ethical community engagement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrating Quechua languages into Common Voice platform

Collecting reading and spontaneous speech corpus data

Addressing technical challenges and ethical considerations

🔎 Similar Papers

No similar papers found.