🤖 AI Summary
This study addresses the challenge of automated fluency assessment for children’s speech in low-resource languages (Tamil and Malay). We propose an end-to-end lightweight framework: (1) robust speech-to-text transcription using fine-tuned multilingual ASR models (mBART/Whisper); (2) extraction of objective acoustic features—including phoneme error rate, speaking rate, and pause ratio; and (3) a novel integration of lightly fine-tuned GPT-based classifiers (GPT-3.5/4) that achieve high accuracy with minimal labeled data. To our knowledge, this is the first work to synergistically combine large language models with interpretable acoustic metrics for low-resource pediatric speech assessment. Experiments on Tamil and Malay child speech corpora yield a weighted F1-score of 86.5%, significantly outperforming ChatGPT-4o (+12.3%) and conventional machine learning baselines (+9.7%). The approach effectively alleviates the dual bottlenecks of scarce annotations and limited model generalizability in low-resource settings.
📝 Abstract
Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.