🤖 AI Summary
To address task interference caused by hard parameter sharing in speech-to-text multitask learning, this paper proposes a Supervised Mixture-of-Experts (S-MoE) architecture for joint modeling of automatic speech recognition (ASR) and speech translation (ST), supporting mixed-bandwidth inputs. Unlike conventional MoE approaches, S-MoE eliminates learnable gating mechanisms and instead employs task-specific guidance tokens to directly route representations to dedicated feed-forward experts, thereby decoupling task representations and eliminating parameter competition. Both encoder and decoder integrate independent expert subnetworks, enabling fine-grained task isolation and parallel optimization. Evaluated on standard multitask benchmarks, S-MoE achieves a 6.35% relative reduction in word error rate (WER), significantly improving the synergy between ASR and ST. This work introduces an efficient, interpretable, and gating-free paradigm for speech multitask modeling.
📝 Abstract
Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.