Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Age and gender classification of children’s speech is challenging due to high inter- and intra-speaker variability in pitch, articulation, and vocal tract development; however, the representational mechanisms underlying self-supervised speech models (e.g., Wav2Vec2) for speaker attributes remain poorly understood. This work systematically analyzes layer-wise speaker characteristic encoding in Wav2Vec2 variants, revealing that early layers capture acoustic individuality cues—such as age- and gender-correlated phonetic and prosodic features—while deeper layers encode linguistic content. Leveraging this insight, we propose a hierarchical feature extraction strategy tailored to children’s speech. Evaluated on PFSTAR and CMU Kids datasets, Wav2Vec2-large-lv60 with PCA-based dimensionality reduction achieves 97.14% accuracy for age classification and 98.20% for gender classification on CMU Kids. This study provides the first empirical characterization of speaker-related feature distribution across SSL model layers, establishing an adaptable modeling paradigm for child-centered perceptual interfaces.

Technology Category

Application Category

📝 Abstract
Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.
Problem

Research questions and friction points this paper is trying to address.

Analyzing SSL models for age and gender classification in children's speech
Exploring layer-wise effectiveness in capturing speaker-specific traits
Improving classification accuracy using PCA and adaptive strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise analysis of Wav2Vec2 variants
Early layers capture speaker-specific cues
PCA improves classification by reducing redundancy
🔎 Similar Papers
No similar papers found.
Abhijit Sinha
Abhijit Sinha
Research Scholar, NIT Sikkim
Speech ProcessingChildren's Speech Recognition
H
Harishankar Kumar
Department of Electronics and Communication Engineering, NIT Sikkim, India
M
Mohit Joshi
Department of Electronics and Communication Engineering, NIT Sikkim, India
Hemant Kumar Kathania
Hemant Kumar Kathania
Assistant Professor NIT Sikkim
Children Speech RecognitionLow ResourceZero shotkeyword spottingMachine Learning
S
Shrikanth Narayanan
Signal Analysis and Interpretation Lab (SAIL), University of Southern California, USA
Sudarsana Reddy Kadiri
Sudarsana Reddy Kadiri
University of Southern California
Speech ProcessingBiomedical SignalsMultimodalityHealthcare InformaticsDeep Learning