🤖 AI Summary
This study addresses the challenges of Arabic text-to-speech (TTS) synthesis, particularly the scarcity of publicly available datasets and inaccuracies in automatic diacritization. To overcome these limitations, the authors develop an automated pipeline integrating voice activity detection, automatic speech recognition, diacritic prediction, and noise filtering to process approximately 4,000 hours of Arabic speech, yielding a high-quality training corpus. Using this data, they train end-to-end TTS models at multiple scales. Experimental results demonstrate that while models trained on diacritized text achieve superior performance, leveraging large-scale undiacritized data significantly narrows the quality gap. This work substantially reduces reliance on manual annotation and introduces the first open-source Arabic TTS model capable of operating without explicit diacritics.
📝 Abstract
Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.