SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current simultaneous speech-to-text translation (SimulST) systems struggle to jointly optimize translation quality, low latency, and semantic coherence in multilingual many-to-many settings, while heterogeneous read-write policies impede unified policy learning. This paper proposes an unsupervised policy learning framework: it implicitly models read-generate decisions via prefix training and introduces a Mixture-of-Experts (MoE) router as a cross-lingual unified policy executor—requiring no additional inference overhead. Integrated with an MoE-based refinement mechanism, the approach enables streaming speech-to-text and speech-to-speech translation within a Transformer architecture. Evaluated on six language pairs, our 500M-parameter model achieves <7% BLEU degradation at 1.5-second average latency and <3% at 3 seconds—significantly outperforming the Seamless baseline. To our knowledge, this is the first work to jointly optimize high translation quality, low latency, and policy scalability in multilingual SimulST.

Technology Category

Application Category

📝 Abstract
Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.
Problem

Research questions and friction points this paper is trying to address.

Balancing translation quality, latency, and semantic coherence in simultaneous speech translation
Addressing divergent read-write policies in multilingual many-to-many translation scenarios
Developing efficient simultaneous translation without adding inference-time overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised policy learning with Mixture-of-Experts
Prefix-based training without inference overhead
Minimal modifications to standard transformer architecture
🔎 Similar Papers
No similar papers found.
Chenyang Le
Chenyang Le
Shanghai Jiaotong University
B
Bing Han
Shanghai Jiao Tong University, China
J
Jinshun Li
Shanghai Jiao Tong University, China
S
Songyong Chen
Shanghai Jiao Tong University, China
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning