Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the failure of conventional audio-visual synchronization modeling in egocentric videos—caused by occlusions, motion blur, and acoustic interference—this paper proposes SL-ASD, a framework that abandons strict temporal alignment in favor of modeling biometric correlations between faces and speech across modalities. Methodologically, SL-ASD introduces a visual-quality-aware dynamic-weighted Transformer encoder, integrated with an utterance-level speech segmentation frontend, to achieve robust speaker activity detection. Unlike state-of-the-art approaches relying on fine-grained temporal synchronization, SL-ASD reduces model parameters by approximately 60% while attaining comparable or superior performance on challenging benchmarks such as EPIC-Kitchens. These results empirically validate the core hypothesis that semantic cross-modal association is more effective and generalizable than precise temporal synchronization for speaker activity detection in unconstrained egocentric settings.

Technology Category

Application Category

📝 Abstract
Audiovisual active speaker detection (ASD) is conventionally performed by modelling the temporal synchronisation of acoustic and visual speech cues. In egocentric recordings, however, the efficacy of synchronisation-based methods is compromised by occlusions, motion blur, and adverse acoustic conditions. In this work, a novel framework is proposed that exclusively leverages cross-modal face-voice associations to determine speaker activity. An existing face-voice association model is integrated with a transformer-based encoder that aggregates facial identity information by dynamically weighting each frame based on its visual quality. This system is then coupled with a front-end utterance segmentation method, producing a complete ASD system. This work demonstrates that the proposed system, Self-Lifting for audiovisual active speaker detection(SL-ASD), achieves performance comparable to, and in certain cases exceeding, that of parameter-intensive synchronisation-based approaches with significantly fewer learnable parameters, thereby validating the feasibility of substituting strict audiovisual synchronisation modelling with flexible biometric associations in challenging egocentric scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detect active speakers in egocentric videos using face-voice associations
Overcome occlusion and noise limitations in synchronisation-based methods
Achieve high performance with fewer parameters than traditional approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages cross-modal face-voice associations
Uses transformer-based encoder for facial identity
Integrates front-end utterance segmentation method
🔎 Similar Papers
No similar papers found.