NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

๐Ÿ“… 2025-07-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current expressive text-to-speech (TTS) synthesis is hindered by the scarcity of open-source datasets featuring non-linguistic vocalizations (e.g., laughter, coughing) and fine-grained emotional annotations. To address this, we introduce the first large-scale, text-aligned, publicly available English corpus of non-linguistic vocalizationsโ€”17 hours in duration, covering 10 vocalization types and annotated with eight emotion categories. We propose a robust data curation pipeline integrating automatic speech recognition (ASR), non-linguistic sound event detection, multi-model emotion classification, and human verification. Leveraging this corpus, we fine-tune open-source TTS models. Experiments demonstrate that our approach achieves naturalness and non-linguistic vocalization fidelity comparable to state-of-the-art proprietary systems such as CosyVoice2, while substantially strengthening the data foundation and modeling capabilities for expressive TTS.

Technology Category

Application Category

๐Ÿ“ Abstract
Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.
Problem

Research questions and friction points this paper is trying to address.

Lack of open-source datasets for diverse nonverbal vocalizations in TTS
Need for emotion-annotated NV datasets to enhance expressive speech synthesis
Absence of comprehensive pipelines integrating ASR, NV tagging, and emotion classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated NV detection with human validation
Pipeline integrates ASR, NV tagging, emotion classification
Fine-tunes TTS models for NV and emotion fidelity
๐Ÿ”Ž Similar Papers
No similar papers found.