๐ค AI Summary
Current expressive text-to-speech (TTS) synthesis is hindered by the scarcity of open-source datasets featuring non-linguistic vocalizations (e.g., laughter, coughing) and fine-grained emotional annotations. To address this, we introduce the first large-scale, text-aligned, publicly available English corpus of non-linguistic vocalizationsโ17 hours in duration, covering 10 vocalization types and annotated with eight emotion categories. We propose a robust data curation pipeline integrating automatic speech recognition (ASR), non-linguistic sound event detection, multi-model emotion classification, and human verification. Leveraging this corpus, we fine-tune open-source TTS models. Experiments demonstrate that our approach achieves naturalness and non-linguistic vocalization fidelity comparable to state-of-the-art proprietary systems such as CosyVoice2, while substantially strengthening the data foundation and modeling capabilities for expressive TTS.
๐ Abstract
Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.