UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

State-of-the-art active speaker detection (ASD) models perform well on synthetic benchmarks like AVA but exhibit severe generalization failure in real-world scenarios—characterized by multilingual speech, heavy acoustic noise, and overlapping speakers. Method: We introduce UniTalk, the first real-world-oriented universal ASD benchmark, comprising 44.5 hours of video and 48,693 frame-level speaker annotations. It systematically models practical challenges and mitigates domain shift. We propose a multimodal temporal alignment architecture that fuses visual and audio features, coupled with a robust labeling strategy enabling end-to-end training and cross-domain evaluation. Results: SOTA models suffer substantial performance degradation on UniTalk, confirming its rigor. Conversely, models trained on UniTalk achieve superior generalization across diverse benchmarks—including Talkies, ASW, and AVA—establishing a new evaluation paradigm and a practical, realistic standard for ASD.

Technology Category

Application Category

📝 Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern"in-the-wild"datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code

Problem

Research questions and friction points this paper is trying to address.

Addresses domain gaps in active speaker detection datasets

Focuses on diverse real-world conditions like noisy backgrounds

Improves model generalization for underrepresented languages and crowded scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse real-world dataset for speaker detection

Frame-level annotations across 48,693 identities

Enhanced generalization with challenging scenarios

🔎 Similar Papers

No similar papers found.

Authors to Follow