🤖 AI Summary
Age and gender classification of children’s speech is challenging due to high inter- and intra-speaker variability in pitch, articulation, and vocal tract development; however, the representational mechanisms underlying self-supervised speech models (e.g., Wav2Vec2) for speaker attributes remain poorly understood. This work systematically analyzes layer-wise speaker characteristic encoding in Wav2Vec2 variants, revealing that early layers capture acoustic individuality cues—such as age- and gender-correlated phonetic and prosodic features—while deeper layers encode linguistic content. Leveraging this insight, we propose a hierarchical feature extraction strategy tailored to children’s speech. Evaluated on PFSTAR and CMU Kids datasets, Wav2Vec2-large-lv60 with PCA-based dimensionality reduction achieves 97.14% accuracy for age classification and 98.20% for gender classification on CMU Kids. This study provides the first empirical characterization of speaker-related feature distribution across SSL model layers, establishing an adaptable modeling paradigm for child-centered perceptual interfaces.
📝 Abstract
Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.