Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

📅 2025-07-23
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Simultaneous interpretation (SI) faces critical bottlenecks—including inaccurate speech transcription, high latency, speaker diarization errors, target-language expansion (“translation inflation”), and lack of real-time speech generation—especially in extended dialogues. This paper proposes an end-to-end duplex speech understanding–generation framework that jointly integrates automatic speech recognition (ASR), machine translation (MT), text-to-speech synthesis (TTS), and voice cloning. Leveraging large-scale pretraining and reinforcement learning for joint optimization, the framework preserves source-speaker vocal characteristics while achieving ultra-low-latency response. Key contributions include: (i) the first integration of controllable voice cloning into end-to-end SI, effectively resolving multi-speaker confusion and translation inflation; (ii) a ~70% average latency reduction—cloned speech latency drops from 10 s to 3 s; and (iii) human evaluation showing >70% accuracy, with significant improvements in both translation quality and real-time performance over leading commercial systems.

Technology Category

Application Category

📝 Abstract
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
Problem

Research questions and friction points this paper is trying to address.

Improving real-time speech-to-speech translation accuracy and latency
Enabling voice cloning in simultaneous interpretation systems
Reducing multi-speaker confusion and speech inflation in translations
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end simultaneous speech-to-speech translation
Duplex speech understanding-generating framework
Ultra-low-latency with voice cloning
🔎 Similar Papers
No similar papers found.
Shanbo Cheng
Shanbo Cheng
ByteDance Seed
LLMsMLNLPMachine TranslationMulti modal
Y
Yu Bao
ByteDance Seed
Z
Zhichao Huang
ByteDance Seed
Y
Yu Lu
ByteDance Seed
Ningxin Peng
Ningxin Peng
ByteDance Research
Lu Xu
Lu Xu
Postdoc, Riken AIP
deep learningmachine learningcomputer vision
Runsheng Yu
Runsheng Yu
Unknown affiliation
R
Rong Cao
ByteDance Seed
T
Ting Han
ByteDance Seed
Zeyang Li
Zeyang Li
KTH Royal Institute of Technology
channel modelingwireless sensingreconfigurable intelligent surface-aided networksmillimeter wa
Sitong Liu
Sitong Liu
Duke University
S
Shengtao Ma
ByteDance Seed
S
Shiguang Pan
ByteDance Seed
J
Jiongchen Xiao
ByteDance Seed
N
Nuo Xu
ByteDance Seed
M
Meng Yang
ByteDance Seed
Rong Ye
Rong Ye
ByteDance
NLPSpeech TranslationLLMsLLM Agent
Y
Yiming Yu
ByteDance Seed
Ruofei Zhang
Ruofei Zhang
ByteDance Seed
W
Wanyi Zhang
ByteDance Seed
Wenhao Zhu
Wenhao Zhu
ByteDance Seed
Large Language ModelMachine Translation
L
Liehao Zou
ByteDance Seed
L
Lu Lu
ByteDance Seed
Y
Yuxuan Wang
ByteDance Seed
Y
Yonghui Wu
ByteDance Seed