SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

๐Ÿ“… 2024-09-11
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key challenges in zero-shot text-driven speech editing and TTS synthesisโ€”namely, poor stability, weak security, limited robustness to interference, and difficulty in multi-segment editing. To this end, we propose a novel neural auto-regressive encoder-decoder model. Methodologically, it adopts a Transformer-based decoder architecture with classifier-free guidance; introduces a frame-level detectable watermark encoder for precise localization of edited regions; reconstructs waveforms by fusing original speech segments to enhance fidelity; and incorporates a customized Watermark Encodec for improved robustness. Experiments on RealEdit and LibriTTS benchmarks demonstrate state-of-the-art performance: the model enables fine-grained, multi-segment editing, exhibits strong robustness against background noise, and ensures both content security and edit traceability.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.
Problem

Research questions and friction points this paper is trying to address.

Speech Editing
Text-to-Speech Conversion
Robustness to Interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

SSR-Speech
Transformer Decoder
Watermarking Technique
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Helin Wang
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
M
Meng Yu
Tencent AI Lab, Bellevue, USA
Jiarui Hai
Jiarui Hai
Johns Hopkins University
computer auditiongenerative modelsmusic information retrieval
C
Chen Chen
Nanyang Technological University, Singapore
Y
Yuchen Hu
Nanyang Technological University, Singapore
R
Rilin Chen
Tencent AI Lab, Beijing, China
N
N. Dehak
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
D
Dong Yu
Tencent AI Lab, Bellevue, USA