Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling

📅 2024-09-14
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of automatic speaker diarization—distinguishing child versus adult speakers—in Behavioral Observation of Social Communication Change (BOSCC) interactions during autism spectrum disorder (ASD) interventions. We introduce the first egocentric, wearable-audio system for first-person speech acquisition in clinical settings. Methodologically, we pioneer the integration of egocentric speech modeling into pediatric behavioral analysis, leveraging the large-scale Ego4D pre-trained model for few-shot transfer learning to fine-tune a binary speaker classification network. Our pipeline combines on-body audio sensing, egocentric speech feature extraction, and domain-adaptive fine-tuning. Results demonstrate substantial improvement in child–adult speaker discrimination accuracy. This work validates the efficacy and feasibility of egocentric audio acquisition and pretraining paradigms for clinical behavioral modeling, establishing a novel framework for objective, automated assessment of ASD intervention outcomes.

Technology Category

Application Category

📝 Abstract
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children's behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children's behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.
Problem

Research questions and friction points this paper is trying to address.

Automatic speech understanding in child-adult dyadic interactions
Egocentric speech modeling for speaker classification
Improving accuracy in child-adult speaker classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wearable sensors for egocentric speech sampling
Pre-training Ego4D speech samples
Enhancing child-adult speaker classification accuracy
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
Anfeng Xu
Anfeng Xu
University of Southern California
Speech ProcessingMultimodal AILLMDeep Learning
X
Xuan Shi
University of Southern California, Los Angeles, USA
S
Somer Bishop
Department of Psychiatry, University of California, San Francisco, USA
S
Shrikanth Narayanan
University of Southern California, Los Angeles, USA