SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation of foundation models in low-resource Arabic–English code-switching (CS) speech recognition due to data scarcity, this paper proposes SAGE—a speech synthesis method for audio splicing—and an experience-replay-inspired incremental fine-tuning strategy. SAGE generates high-fidelity artificial CS speech via controllable dialect alignment and cross-lingual prosody modeling. The experience replay mechanism mitigates catastrophic forgetting during few-shot adaptation. Integrated with a self-supervised speech model, 3-gram language model fusion, and few-shot adaptation, the approach achieves a word error rate (WER) of 31.1% on the Arabic–English CS benchmark—reducing absolute WER by 5.5% and 8.4% over USM and Whisper-large-v2, respectively. The method significantly enhances robustness to Arabic dialectal variation and generalization to CS phenomena.

Technology Category

Application Category

📝 Abstract
This paper investigates the performance of various speech SSL models on dialectal Arabic (DA) and Arabic-English code-switched (CS) speech. To address data scarcity, a modified audio-splicing approach is introduced to generate artificial CS speech data. Fine-tuning an already fine-tuned SSL model with the proposed Spliced-Audio Generated (SAGE) data results in an absolute improvement on Word Error Rate (WER) of 7.8% on Arabic and English CS benchmarks. Additionally, an Experience Replay (ER) inspired approach is proposed to enhance generalisation across DA and CS speech while mitigating catastrophic forgetting. Integrating an out-of-domain 3-gram language model reduces the overall mean WER from 31.7% to 26.6%. Few-shot fine-tuning for code-switching benchmarks further improves WER by 4.9%. A WER of 31.1% on Arabic-English CS benchmarks surpasses large-scale multilingual models, including USM and Whisper-large-v2 (both over ten times larger) by an absolute margin of 5.5% and 8.4%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech recognition for low-resource Arabic-English code-switched speech
Generating artificial code-switched data to address scarcity
Improving model performance while preventing catastrophic forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified audio-splicing generates artificial CS speech
Experience Replay enhances DA and CS generalization
Out-of-domain 3-gram LM reduces WER significantly
🔎 Similar Papers
No similar papers found.