DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient reconstruction fidelity and noise robustness of high-quality neural speech codecs in text-to-speech (TTS) systems. We propose DS-Codec, a TTS-optimized two-stage neural speech codec. Its core innovation is a novel mirror/non-mirror architecture co-training mechanism: Stage I employs a mirror-symmetric encoder-decoder to enhance codebook stability and noise robustness; Stage II switches to a non-mirror architecture to maximize reconstruction accuracy. Structural dynamic switching and joint codebook optimization—guided by ablation-driven design—enable DS-Codec to surpass state-of-the-art baselines across key metrics, including Mel Cepstral Distortion (MCD) and Short-Time Objective Intelligibility (STOI), under both clean and noisy conditions. Experiments demonstrate significantly improved speech tokenization quality, achieving high-fidelity, low-distortion waveform reconstruction. DS-Codec thus establishes a superior acoustic representation foundation for modern TTS systems.

Technology Category

Application Category

📝 Abstract
Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Develop high-quality speech tokenizers for TTS systems
Achieve superior speech reconstruction via dual-stage training
Balance robustness and fidelity in neural speech codecs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stage training framework for speech codec
Mirror-to-NonMirror architecture switching strategy
Enhanced robustness and high-fidelity speech reconstruction
🔎 Similar Papers
No similar papers found.
P
Peijie Chen
School of Informatics, Xiamen University, China
Wenhao Guan
Wenhao Guan
Xiamen University
speech
K
Kaidi Wang
School of Informforms, Xiamen University, China
Weijie Wu
Weijie Wu
Roblox
Computer Networks
H
Hukai Huang
School of Informatics, Xiamen University, China
Q
Q. Hong
School of Informatics, Xiamen University, China
L
Lin Li
School of Electronic Science and Engineering, Xiamen University, China