Context-aware child-directed speech detection from long-form recordings

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the challenge of automatically distinguishing child-directed speech (CDS) from adult-directed speech (ADS) in long-duration, real-world recordings by proposing an end-to-end multilingual framework. Built upon a domain-pretrained self-supervised speech model, the approach integrates contextual modeling and automatic speech segmentation to jointly optimize speaker detection and audience classification—marking the first such system evaluated across multiple languages in naturalistic settings. Experimental results demonstrate that contextual modeling yields an absolute improvement of 13.8% in average F1 score over rule-based baselines, with robust performance maintained even under automatic segmentation conditions. These findings underscore the critical role of domain-specific pretraining for child-centered speech processing tasks.

📝 Abstract

Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.

Problem

Research questions and friction points this paper is trying to address.

child-directed speech

adult-directed speech

long-form recordings

speech classification

language environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware modeling

child-directed speech detection

self-supervised learning