Exploring Spatiotemporal Emotional Synchrony in Dyadic Interactions: The Role of Speech Conditions in Facial and Vocal Affective Alignment

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the spatiotemporal emotional synchronization between facial and vocal modalities during dyadic conversations, specifically along arousal and valence dimensions, with a focus on how speech overlap versus non-overlap conditions modulate cross-modal alignment. We employ EmoNet for continuous facial affect estimation and fine-tuned Wav2Vec2 for speech-based affect modeling, quantifying synchronization via Pearson correlation, lag analysis, and dynamic time warping (DTW). Our key findings reveal that speech overlap critically regulates synchronization stability and temporal lag patterns: non-overlapping segments exhibit higher synchronization stability (reduced arousal variability; narrower lag distributions), whereas overlapping segments—despite lower DTW distances—show flattened, highly uncertain lag profiles. Furthermore, we identify a structural shift in modality dominance: facial expressions lead turn-taking speech, while vocal signals lead overlapping speech. These results establish a novel “dialogue-structure-driven emotional coordination” paradigm.

Technology Category

Application Category

📝 Abstract
Understanding how humans express and synchronize emotions across multiple communication channels particularly facial expressions and speech has significant implications for emotion recognition systems and human computer interaction. Motivated by the notion that non-overlapping speech promotes clearer emotional coordination, while overlapping speech disrupts synchrony, this study examines how these conversational dynamics shape the spatial and temporal alignment of arousal and valence across facial and vocal modalities. Using dyadic interactions from the IEMOCAP dataset, we extracted continuous emotion estimates via EmoNet (facial video) and a Wav2Vec2-based model (speech audio). Segments were categorized based on speech overlap, and emotional alignment was assessed using Pearson correlation, lag adjusted analysis, and Dynamic Time Warping (DTW). Across analyses, non overlapping speech was associated with more stable and predictable emotional synchrony than overlapping speech. While zero-lag correlations were low and not statistically different, non overlapping speech showed reduced variability, especially for arousal. Lag adjusted correlations and best-lag distributions revealed clearer, more consistent temporal alignment in these segments. In contrast, overlapping speech exhibited higher variability and flatter lag profiles, though DTW indicated unexpectedly tighter alignment suggesting distinct coordination strategies. Notably, directionality patterns showed that facial expressions more often preceded speech during turn-taking, while speech led during simultaneous vocalizations. These findings underscore the importance of conversational structure in regulating emotional communication and provide new insight into the spatial and temporal dynamics of multimodal affective alignment in real world interaction.
Problem

Research questions and friction points this paper is trying to address.

Examining emotional synchrony in dyadic interactions across facial and vocal modalities
Investigating how speech overlap affects arousal and valence alignment in conversations
Analyzing spatial-temporal dynamics of multimodal affective coordination in real-world interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmoNet and Wav2Vec2 for emotion extraction
Pearson correlation and DTW for alignment
Analyzed non-overlapping speech synchrony
🔎 Similar Papers
No similar papers found.
V
V. R. D. M. Herbuela
International Research Center for Neurointelligence, The University of Tokyo, Tokyo, Japan
Yukie Nagai
Yukie Nagai
The University of Tokyo
cognitive developmental roboticscomputational neurosciencehuman-robot interaction