Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of active speaker detection (ASD) in first-person videos, where visual noise and motion blur degrade reliability. To mitigate visual uncertainty, we propose a multimodal fusion approach that incorporates fine-grained speaker embeddings—extracted from candidate speakers’ reference speech—to enhance robustness in voice activity discrimination. We introduce the Speaker Comparison Auxiliary Network (SCAN), enabling fine-grained audio–speaker embedding contrast, and design a self-supervised face–speaker alignment framework to improve identity–visual correspondence in egocentric settings. Our method integrates ECAPA-TDNN for speaker embedding extraction, temporal audio-visual synchronization modeling, and self-supervised facial representation learning. Evaluated on the Ego4D benchmark, our approach achieves mAP improvements of 14.5% over TalkNet and 10.3% over Light-ASD, significantly enhancing speaker localization accuracy in dynamic, noisy environments.

Technology Category

Application Category

📝 Abstract
Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improve audiovisual speaker detection accuracy
Leverage speaker embeddings for disambiguation
Enhance face-speaker library enrollment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker Embedding Integration
Self-supervised Face Recognition
Egocentric Recording Optimization
🔎 Similar Papers
No similar papers found.