🤖 AI Summary
Existing child speech recognition models heavily rely on adult speech data, rendering them inadequate for handling the acoustic variability and scarce annotations characteristic of early-developing or language-impaired children’s speech. To address this, we propose the first multitask foundation model specifically designed for child speech, integrating phoneme-level prior knowledge with a novel two-stage training framework. We further introduce FASA, a self-developed, high-accuracy automatic speech alignment tool that enables noise-robust, fine-grained phoneme-to-audio alignment. On real child speech, FASA achieves a 13.6× improvement in alignment quality over manual annotation. Leveraging FASA, we construct a high-fidelity child speech dataset that substantially enhances model generalization. Our model attains an average accuracy of 87% across four downstream tasks—automatic speech recognition, pronunciation assessment, language development screening, and phonemic error detection—demonstrating strong efficacy and robustness in clinical and educational applications.
📝 Abstract
With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.