Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

📅 2024-09-13
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Automatic speech recognition (ASR) for extremely low-resource endangered languages—specifically Amis and Seediq—is severely hindered by acute training data scarcity. Method: We investigate the relationship between data scale and ASR performance under continual self-supervised pretraining, proposing a novel multilingual data selection strategy. This approach jointly employs a language classifier and a one-class classifier to identify phonetically proximate cross-lingual speech segments, thereby enhancing the relevance and utility of target-language data. The method integrates self-supervised learning, multilingual representation learning, and phonetic embedding similarity ranking. Contribution/Results: Experiments demonstrate substantial ASR performance gains under extremely low-resource conditions, validating the efficacy and feasibility of cross-lingual transfer-based data augmentation for endangered language recognition. Our framework establishes a scalable, generalizable paradigm for modeling languages with minimal or no dedicated speech resources.

Technology Category

Application Category

📝 Abstract
This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
Problem

Research questions and friction points this paper is trying to address.

Improving ASR for low-resource endangered languages
Investigating data volume impact on self-supervised learning
Developing data-selection methods using cross-lingual transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed novel data-selection scheme using multilingual corpus
Utilized language classifier and one-class classifiers for selection
Ranked utterances by phonetic and phonological proximity scores
🔎 Similar Papers
No similar papers found.