🤖 AI Summary
Assessing isolated-word reading proficiency in children under low-resource conditions remains challenging—particularly when child speech data is scarce and automatic speech recognition (ASR) systems are impractical to deploy.
Method: We propose an ASR-free few-shot evaluation framework leveraging intermediate representations from large-scale self-supervised learning (SSL) models. Our approach introduces discretized SSL features, template centroid averaging, and integrates dynamic time warping with prototypical network principles to score children’s utterances by comparing them against a small set of adult or child reference templates.
Contribution/Results: Experiments on the Afrikaans child speech dataset reveal robust performance on adult speech but significant degradation on child speech—even with child-specific templates. This work provides the first systematic evidence of inherent limitations in current SSL representations for few-shot modeling of child speech, establishing a new perspective and benchmark analysis for low-resource pediatric speech assessment.
📝 Abstract
We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.