Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Simultaneous interpretation (SI) faces critical bottlenecks—including inaccurate speech transcription, high latency, speaker diarization errors, target-language expansion (“translation inflation”), and lack of real-time speech generation—especially in extended dialogues. This paper proposes an end-to-end duplex speech understanding–generation framework that jointly integrates automatic speech recognition (ASR), machine translation (MT), text-to-speech synthesis (TTS), and voice cloning. Leveraging large-scale pretraining and reinforcement learning for joint optimization, the framework preserves source-speaker vocal characteristics while achieving ultra-low-latency response. Key contributions include: (i) the first integration of controllable voice cloning into end-to-end SI, effectively resolving multi-speaker confusion and translation inflation; (ii) a ~70% average latency reduction—cloned speech latency drops from 10 s to 3 s; and (iii) human evaluation showing >70% accuracy, with significant improvements in both translation quality and real-time performance over leading commercial systems.

Technology Category

Application Category

📝 Abstract

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

Problem

Research questions and friction points this paper is trying to address.

Improving real-time speech-to-speech translation accuracy and latency

Enabling voice cloning in simultaneous interpretation systems

Reducing multi-speaker confusion and speech inflation in translations

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end simultaneous speech-to-speech translation

Duplex speech understanding-generating framework

Ultra-low-latency with voice cloning

🔎 Similar Papers

No similar papers found.