ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

📅 2024-09-27
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
High-quality annotated speech data for automatic speech recognition (ASR) and speaker verification (SV) in Mandarin-speaking children aged 3–5 years is severely lacking. Method: We introduce the first large-scale, open-source Mandarin child speech dataset—comprising 41.25 hours of manually transcribed audio from 397 gender-balanced children across multiple provinces in China. It is the first resource enabling joint ASR and SV modeling for this age group and includes comprehensive speaker demographics and geographic distribution analysis. Contribution/Results: Leveraging this dataset, we conduct systematic modeling: training Conformer models from scratch and fine-tuning state-of-the-art self-supervised (HuBERT) and multilingual (Whisper) models. Fine-tuned models reduce character error rate (CER) by over 30% compared to from-scratch training, while achieving robust performance on both ASR (CER) and SV (equal error rate, EER), thereby validating the dataset’s efficacy and practical utility.

Technology Category

Application Category

📝 Abstract
Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children's speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research. The dataset is now open-source and freely available for all academic purposes on https://github.com/flageval-baai/ChildMandarin.
Problem

Research questions and friction points this paper is trying to address.

Develop robust ASR models for young children's Mandarin speech.
Address scarcity of Mandarin speech datasets for children aged 3-5.
Evaluate ASR and speaker verification on child-specific speech data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Mandarin speech dataset for children aged 3-5
Evaluates ASR models like Conformer, HuBERT, Whisper
Supports both ASR and speaker verification tasks
🔎 Similar Papers
No similar papers found.
J
Jiaming Zhou
College of Computer Science, Nankai University
S
Shiyao Wang
College of Computer Science, Nankai University
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
J
Jiabei He
College of Computer Science, Nankai University
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
H
Hui Wang
College of Computer Science, Nankai University
C
Cheng Liu
College of Computer Science, Nankai University
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
Yujie Guo
Yujie Guo
yujie.guo@ugent.be
low dimensional semiconductors
Y
Yong Qin
College of Computer Science, Nankai University