🤖 AI Summary
This study addresses the challenge of automatically distinguishing childhood pathological stuttering from typical developmental disfluency, a task hindered by their high acoustic similarity and the substantial variability inherent in children’s speech. To overcome these limitations, the authors propose Paediatric-HGNN, a novel pediatric heterogeneous graph neural network that, for the first time, constructs a heterogeneous graph integrating lexical units with frame-level acoustic features. The model incorporates a context-aware part-in-whole interaction network (CaPIN) and employs a multi-scale fusion strategy to hierarchically model linguistic and acoustic relationships, effectively capturing “search behavior” in children’s speech. Evaluated on the UCLASS and FluencyBank datasets, the approach achieves a weighted accuracy of 82.4% and an F1-score of 0.386 for typical disfluencies, significantly outperforming conventional one-dimensional signal-based methods while enhancing interpretability and robustness.
📝 Abstract
Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.