🤖 AI Summary
To address the failure of conventional audio-visual synchronization modeling in egocentric videos—caused by occlusions, motion blur, and acoustic interference—this paper proposes SL-ASD, a framework that abandons strict temporal alignment in favor of modeling biometric correlations between faces and speech across modalities. Methodologically, SL-ASD introduces a visual-quality-aware dynamic-weighted Transformer encoder, integrated with an utterance-level speech segmentation frontend, to achieve robust speaker activity detection. Unlike state-of-the-art approaches relying on fine-grained temporal synchronization, SL-ASD reduces model parameters by approximately 60% while attaining comparable or superior performance on challenging benchmarks such as EPIC-Kitchens. These results empirically validate the core hypothesis that semantic cross-modal association is more effective and generalizable than precise temporal synchronization for speaker activity detection in unconstrained egocentric settings.
📝 Abstract
Audiovisual active speaker detection (ASD) is conventionally performed by modelling the temporal synchronisation of acoustic and visual speech cues. In egocentric recordings, however, the efficacy of synchronisation-based methods is compromised by occlusions, motion blur, and adverse acoustic conditions. In this work, a novel framework is proposed that exclusively leverages cross-modal face-voice associations to determine speaker activity. An existing face-voice association model is integrated with a transformer-based encoder that aggregates facial identity information by dynamically weighting each frame based on its visual quality. This system is then coupled with a front-end utterance segmentation method, producing a complete ASD system. This work demonstrates that the proposed system, Self-Lifting for audiovisual active speaker detection(SL-ASD), achieves performance comparable to, and in certain cases exceeding, that of parameter-intensive synchronisation-based approaches with significantly fewer learnable parameters, thereby validating the feasibility of substituting strict audiovisual synchronisation modelling with flexible biometric associations in challenging egocentric scenarios.