🤖 AI Summary
This study addresses the poor out-of-distribution (OOD) generalization of automatic speech recognition (ASR) systems for low-resource Romanian by introducing RO-N3WS, a benchmark dataset comprising 126 hours of diverse, real-world speech from news broadcasts, audiobooks, movie dialogues, children’s stories, and podcasts. The work presents the first systematic evaluation of state-of-the-art models—including Whisper and Wav2Vec 2.0—under both zero-shot and fine-tuned settings, comparing their performance on authentic versus synthetic speech. Experimental results demonstrate that fine-tuning with only a small amount of real RO-N3WS data substantially reduces word error rate (WER), significantly outperforming zero-shot baselines. These findings underscore the critical role of data diversity in enhancing OOD generalization for low-resource ASR and establish a reproducible benchmark to advance multilingual, low-resource speech recognition research.
📝 Abstract
We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.