GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end multi-talker automatic speech recognition (MTASR) suffers significant performance degradation under high speaker overlap. To address this, we propose the Global-Local Mixture-of-Experts (GL-MoE), the first MoE-based architecture for end-to-end MTASR. GL-MoE jointly models speaker-aware global contextual representations and fine-grained local acoustic features, enabling dynamic, speaker-informed expert routing without explicit speaker separation. By implicitly integrating speaker-discriminative information at the feature level, it enhances modeling capability for overlapping speech. Evaluated on LibriSpeechMix, GL-MoE substantially outperforms existing MTASR methods, achieving a 12.3% relative reduction in word error rate (WER) under high-overlap conditions (≥50% overlap ratio). This demonstrates superior robustness and generalization to challenging multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.
Problem

Research questions and friction points this paper is trying to address.

Accurately transcribing overlapping speech in multi-talker scenarios
Dynamically fusing global and local features for expert selection
Improving performance in high-overlap automatic speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic fusion of global-local features
Speaker-specific routing using acoustic cues
Mixture-of-Experts applied to multi-talker ASR
🔎 Similar Papers
No similar papers found.
Yujie Guo
Yujie Guo
yujie.guo@ugent.be
low dimensional semiconductors
J
Jiaming Zhou
TMCC, College of Computer Science, Nankai University, Tianjin, China
Y
Yuhang Jia
TMCC, College of Computer Science, Nankai University, Tianjin, China
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
Y
Yong Qin
TMCC, College of Computer Science, Nankai University, Tianjin, China