Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Age and gender classification of children’s speech is challenging due to high inter- and intra-speaker variability in pitch, articulation, and vocal tract development; however, the representational mechanisms underlying self-supervised speech models (e.g., Wav2Vec2) for speaker attributes remain poorly understood. This work systematically analyzes layer-wise speaker characteristic encoding in Wav2Vec2 variants, revealing that early layers capture acoustic individuality cues—such as age- and gender-correlated phonetic and prosodic features—while deeper layers encode linguistic content. Leveraging this insight, we propose a hierarchical feature extraction strategy tailored to children’s speech. Evaluated on PFSTAR and CMU Kids datasets, Wav2Vec2-large-lv60 with PCA-based dimensionality reduction achieves 97.14% accuracy for age classification and 98.20% for gender classification on CMU Kids. This study provides the first empirical characterization of speaker-related feature distribution across SSL model layers, establishing an adaptable modeling paradigm for child-centered perceptual interfaces.

Technology Category

Application Category

📝 Abstract

Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.

Problem

Research questions and friction points this paper is trying to address.

Analyzing SSL models for age and gender classification in children's speech

Exploring layer-wise effectiveness in capturing speaker-specific traits

Improving classification accuracy using PCA and adaptive strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise analysis of Wav2Vec2 variants

Early layers capture speaker-specific cues

PCA improves classification by reducing redundancy

🔎 Similar Papers

No similar papers found.

Authors to Follow