KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing child speech recognition models heavily rely on adult speech data, rendering them inadequate for handling the acoustic variability and scarce annotations characteristic of early-developing or language-impaired children’s speech. To address this, we propose the first multitask foundation model specifically designed for child speech, integrating phoneme-level prior knowledge with a novel two-stage training framework. We further introduce FASA, a self-developed, high-accuracy automatic speech alignment tool that enables noise-robust, fine-grained phoneme-to-audio alignment. On real child speech, FASA achieves a 13.6× improvement in alignment quality over manual annotation. Leveraging FASA, we construct a high-fidelity child speech dataset that substantially enhances model generalization. Our model attains an average accuracy of 87% across four downstream tasks—automatic speech recognition, pronunciation assessment, language development screening, and phonemic error detection—demonstrating strong efficacy and robustness in clinical and educational applications.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.
Problem

Research questions and friction points this paper is trying to address.

Develops a speech model for children's unique speech patterns
Creates an alignment tool to improve noisy children's speech data
Addresses gaps in AI for children's speech recognition and therapy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with phonetic knowledge integration
Flexible Automatic Speech Aligner for noisy children's speech
Multi-task speech-enhanced Foundation Model for children
🔎 Similar Papers
No similar papers found.
R
Rohan Sharma
Department of Computer Science and Engineering, State University of New York at Buffalo
Dancheng Liu
Dancheng Liu
PhD student, SUNY Buffalo
deep learningautomatic speech recognitionlarge language model
J
Jingchen Sun
Department of Computer Science and Engineering, State University of New York at Buffalo
S
Shijie Zhou
Department of Computer Science and Engineering, State University of New York at Buffalo
Jiayu Qin
Jiayu Qin
University at Buffalo
machine learning
Jinjun Xiong
Jinjun Xiong
University at Buffalo
AISystemsEnergyDesign Automation
C
Changyou Chen
Department of Computer Science and Engineering, State University of New York at Buffalo