How I Built ASR for Endangered Languages with a Spoken Dictionary

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of building automatic speech recognition (ASR) systems for critically endangered languages—such as Manx Gaelic and Cornish—that lack sentence-level annotated speech corpora. We propose a novel paradigm that substitutes conventional conversational speech data with spoken pronunciation dictionaries. Methodologically, we employ an end-to-end ASR architecture enhanced by dictionary-based forced alignment and acoustic model fine-tuning, specifically optimized for short, non-contiguous, and low-volume audio inputs. Experiments demonstrate that a functional ASR system can be constructed using only ~40 minutes of pronunciation dictionary audio—without requiring manual text transcriptions or time-aligned annotations—achieving word error rates below 50% on both languages. This approach substantially lowers the data requirements for endangered-language ASR and constitutes the first empirical validation of pronunciation dictionaries as sole training data in ultra-low-resource settings. It provides a scalable technical pathway for ASR development across thousands of under-resourced languages lacking spoken corpora.

Technology Category

Application Category

📝 Abstract

Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50% WER). We replicate our approach, applying it to Cornish ($sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

Problem

Research questions and friction points this paper is trying to address.

Developing ASR for endangered languages with limited data

Using short-form pronunciation resources as alternative training data

Reducing barriers to entry for critically endangered language communities

Innovation

Methods, ideas, or system contributions that make the work stand out.

ASR uses short-form pronunciation dictionary data

40 minutes of data achieves under 50% WER

Method replicated successfully for Cornish language

🔎 Similar Papers

No similar papers found.

Authors to Follow