🤖 AI Summary
This study investigates the noise robustness of the Whisper model for Indonesian Javanese and Sundanese—under-resourced regional languages—addressing a critical gap in automatic speech recognition (ASR) evaluation under realistic noisy conditions. To tackle data scarcity and dialectal variability, we propose a noise-aware training framework integrating multi-SNR synthetic noise augmentation, SpecAugment, and cross-domain data utilization. Experimental results show that noise-aware fine-tuning reduces word error rate (WER) by 28.6% on average across both languages for large-scale Whisper models (e.g., whisper-large), significantly outperforming standard baselines. Error analysis identifies phonological ambiguity and dialectal variation as primary sources of degradation. To our knowledge, this is the first systematic validation that noise robustness from large pre-trained ASR models can be effectively transferred to low-resource Indonesian regional languages. The proposed framework establishes a reusable, robust training paradigm for low-resource speech recognition, with implications for real-world deployment in acoustically challenging environments.
📝 Abstract
We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements