🤖 AI Summary
This work addresses the insufficient reconstruction fidelity and noise robustness of high-quality neural speech codecs in text-to-speech (TTS) systems. We propose DS-Codec, a TTS-optimized two-stage neural speech codec. Its core innovation is a novel mirror/non-mirror architecture co-training mechanism: Stage I employs a mirror-symmetric encoder-decoder to enhance codebook stability and noise robustness; Stage II switches to a non-mirror architecture to maximize reconstruction accuracy. Structural dynamic switching and joint codebook optimization—guided by ablation-driven design—enable DS-Codec to surpass state-of-the-art baselines across key metrics, including Mel Cepstral Distortion (MCD) and Short-Time Objective Intelligibility (STOI), under both clean and noisy conditions. Experiments demonstrate significantly improved speech tokenization quality, achieving high-fidelity, low-distortion waveform reconstruction. DS-Codec thus establishes a superior acoustic representation foundation for modern TTS systems.
📝 Abstract
Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.