Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address task interference caused by hard parameter sharing in speech-to-text multitask learning, this paper proposes a Supervised Mixture-of-Experts (S-MoE) architecture for joint modeling of automatic speech recognition (ASR) and speech translation (ST), supporting mixed-bandwidth inputs. Unlike conventional MoE approaches, S-MoE eliminates learnable gating mechanisms and instead employs task-specific guidance tokens to directly route representations to dedicated feed-forward experts, thereby decoupling task representations and eliminating parameter competition. Both encoder and decoder integrate independent expert subnetworks, enabling fine-grained task isolation and parallel optimization. Evaluated on standard multitask benchmarks, S-MoE achieves a 6.35% relative reduction in word error rate (WER), significantly improving the synergy between ASR and ST. This work introduces an efficient, interpretable, and gating-free paradigm for speech multitask modeling.

Technology Category

Application Category

📝 Abstract

Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

Problem

Research questions and friction points this paper is trying to address.

Addresses task interference in multi-task speech-to-text models

Proposes Supervised Mixture of Experts (S-MoE) for task routing

Improves performance on mixed-bandwidth ASR and speech translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Mixture of Experts (S-MoE) model

Guiding tokens for expert routing

Separate feedforward networks per task

🔎 Similar Papers

No similar papers found.