🤖 AI Summary
End-to-end multi-talker automatic speech recognition (MTASR) suffers significant performance degradation under high speaker overlap. To address this, we propose the Global-Local Mixture-of-Experts (GL-MoE), the first MoE-based architecture for end-to-end MTASR. GL-MoE jointly models speaker-aware global contextual representations and fine-grained local acoustic features, enabling dynamic, speaker-informed expert routing without explicit speaker separation. By implicitly integrating speaker-discriminative information at the feature level, it enhances modeling capability for overlapping speech. Evaluated on LibriSpeechMix, GL-MoE substantially outperforms existing MTASR methods, achieving a 12.3% relative reduction in word error rate (WER) under high-overlap conditions (≥50% overlap ratio). This demonstrates superior robustness and generalization to challenging multi-speaker scenarios.
📝 Abstract
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.