CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency and inconsistency challenges of traditional cascaded systems in voice editing, which rely on complex preprocessing and explicit temporal alignment. Building upon the zero-shot text-to-speech model CosyVoice, the authors propose an end-to-end voice editing framework through task-oriented fine-tuning and inference optimization, effectively internalizing speech-text alignment while preserving speaker identity. Trained on a newly curated GigaEdit dataset using only 250 hours of data, the resulting 400M-parameter model is the first to unlock efficient end-to-end editing capabilities from a zero-shot TTS foundation. On the RealEdit benchmark, it surpasses baseline approaches based on billion-parameter language models and matches or exceeds the performance of state-of-the-art cascaded systems.

Technology Category

Application Category

📝 Abstract
Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results demonstrate that, with task-specific fine-tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero-shot TTS model, yielding a novel and cost-effective end-to-end solution for high-quality speech editing.
Problem

Research questions and friction points this paper is trying to address.

speech editing
text-to-speech
temporal alignment
end-to-end
zero-shot
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end speech editing
zero-shot TTS adaptation
task-specific fine-tuning
internalized alignment
efficient speech editing
🔎 Similar Papers
No similar papers found.
Junyang Chen
Junyang Chen
Nanjing University of Science and Technology
Y
Yuhang Jia
College of Computer Science, Nankai University, Tianjin, China
H
Hui Wang
College of Computer Science, Nankai University, Tianjin, China
Jiaming Zhou
Jiaming Zhou
Nankai University
Automatic Speech RecognitionSpeech processing
Y
Yaxin Han
Lingxi Technology, Beijing, China
M
Mengying Feng
Lingxi Technology, Beijing, China
Yong Qin
Yong Qin
Nankai University
speech technologiesAI