UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

📅 2025-05-28
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
State-of-the-art active speaker detection (ASD) models perform well on synthetic benchmarks like AVA but exhibit severe generalization failure in real-world scenarios—characterized by multilingual speech, heavy acoustic noise, and overlapping speakers. Method: We introduce UniTalk, the first real-world-oriented universal ASD benchmark, comprising 44.5 hours of video and 48,693 frame-level speaker annotations. It systematically models practical challenges and mitigates domain shift. We propose a multimodal temporal alignment architecture that fuses visual and audio features, coupled with a robust labeling strategy enabling end-to-end training and cross-domain evaluation. Results: SOTA models suffer substantial performance degradation on UniTalk, confirming its rigor. Conversely, models trained on UniTalk achieve superior generalization across diverse benchmarks—including Talkies, ASW, and AVA—establishing a new evaluation paradigm and a practical, realistic standard for ASD.

Technology Category

Application Category

📝 Abstract
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern"in-the-wild"datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
Problem

Research questions and friction points this paper is trying to address.

Addresses domain gaps in active speaker detection datasets
Focuses on diverse real-world conditions like noisy backgrounds
Improves model generalization for underrepresented languages and crowded scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse real-world dataset for speaker detection
Frame-level annotations across 48,693 identities
Enhanced generalization with challenging scenarios
🔎 Similar Papers
No similar papers found.
Le Thien Phuc Nguyen
Le Thien Phuc Nguyen
University of Wisconsin - Madison
Computer VisionDeep LearningMultimodality
Zhuoran Yu
Zhuoran Yu
University of Wisconsin-Madison
Computer VisionMachine Learning
K
Khoa Quang Nhat Cao
University of Wisconsin–Madison
Y
Yuwei Guo
University of Wisconsin–Madison
T
Tu Ho Manh Pham
University of Wisconsin–Madison
T
Tuan Tai Nguyen
University of Wisconsin–Madison
T
Toan Ngo Duc Vo
University of Wisconsin–Madison
L
Lucas Poon
University of Wisconsin–Madison
S
Soochahn Lee
Kookmin University
Y
Yong Jae Lee
University of Wisconsin–Madison