DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient reconstruction fidelity and noise robustness of high-quality neural speech codecs in text-to-speech (TTS) systems. We propose DS-Codec, a TTS-optimized two-stage neural speech codec. Its core innovation is a novel mirror/non-mirror architecture co-training mechanism: Stage I employs a mirror-symmetric encoder-decoder to enhance codebook stability and noise robustness; Stage II switches to a non-mirror architecture to maximize reconstruction accuracy. Structural dynamic switching and joint codebook optimization—guided by ablation-driven design—enable DS-Codec to surpass state-of-the-art baselines across key metrics, including Mel Cepstral Distortion (MCD) and Short-Time Objective Intelligibility (STOI), under both clean and noisy conditions. Experiments demonstrate significantly improved speech tokenization quality, achieving high-fidelity, low-distortion waveform reconstruction. DS-Codec thus establishes a superior acoustic representation foundation for modern TTS systems.

Technology Category

Application Category

📝 Abstract

Neural speech codecs are essential for advancing text-to-speech (TTS) systems. With the recent success of large language models in text generation, developing high-quality speech tokenizers has become increasingly important. This paper introduces DS-Codec, a novel neural speech codec featuring a dual-stage training framework with mirror and non-mirror architectures switching, designed to achieve superior speech reconstruction. We conduct extensive experiments and ablation studies to evaluate the effectiveness of our training strategy and compare the performance of the two architectures. Our results show that the mirrored structure significantly enhances the robustness of the learned codebooks, and the training strategy balances the advantages between mirrored and non-mirrored structures, leading to improved high-fidelity speech reconstruction.

Problem

Research questions and friction points this paper is trying to address.

Develop high-quality speech tokenizers for TTS systems

Achieve superior speech reconstruction via dual-stage training

Balance robustness and fidelity in neural speech codecs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stage training framework for speech codec

Mirror-to-NonMirror architecture switching strategy

Enhanced robustness and high-fidelity speech reconstruction

🔎 Similar Papers

No similar papers found.

Authors to Follow