GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

End-to-end multi-talker automatic speech recognition (MTASR) suffers significant performance degradation under high speaker overlap. To address this, we propose the Global-Local Mixture-of-Experts (GL-MoE), the first MoE-based architecture for end-to-end MTASR. GL-MoE jointly models speaker-aware global contextual representations and fine-grained local acoustic features, enabling dynamic, speaker-informed expert routing without explicit speaker separation. By implicitly integrating speaker-discriminative information at the feature level, it enhances modeling capability for overlapping speech. Evaluated on LibriSpeechMix, GL-MoE substantially outperforms existing MTASR methods, achieving a 12.3% relative reduction in word error rate (WER) under high-overlap conditions (≥50% overlap ratio). This demonstrates superior robustness and generalization to challenging multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.

Problem

Research questions and friction points this paper is trying to address.

Accurately transcribing overlapping speech in multi-talker scenarios

Dynamically fusing global and local features for expert selection

Improving performance in high-overlap automatic speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic fusion of global-local features

Speaker-specific routing using acoustic cues

Mixture-of-Experts applied to multi-talker ASR

🔎 Similar Papers

No similar papers found.