HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech discrete tokenization is critical for speech coding but suffers from high computational overhead and deployment complexity due to parallel multi-quantizer architectures and high-dimensional encoders. This paper proposes HH-Codec, a single-quantizer neural codec tailored for spoken language modeling, achieving both high compression and high fidelity. Its core innovations include a speech-optimized vector quantization space, an asymmetric Audio-VQ-Mel-Audio coding-decoding architecture, and a dual-supervision (waveform + mel-spectrogram) optimization strategy with progressive training. HH-Codec achieves high-quality reconstruction of 24 kHz audio at an ultra-low bitrate of 0.3 kbps (24 tokens/s), significantly improving codebook utilization and reconstruction stability. Experiments demonstrate state-of-the-art performance on speech reconstruction tasks, strong adaptability to generative models, and ablation studies confirm the necessity of each component. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at https://github.com/opendilab/HH-Codec.
Problem

Research questions and friction points this paper is trying to address.

High compression speech codec for spoken language modeling
Single-quantizer inference reduces computational complexity
Optimizes compression efficiency with minimal information loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-quantizer inference for extreme compression
Vector Quantization space for efficient compression
Asymmetric encoder-decoder with dual supervision
🔎 Similar Papers
No similar papers found.
R
Rongkun Xue
The School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, China
Y
Yazhe Niu
Shanghai AI Laboratory, Shanghai, China
Shuai Hu
Shuai Hu
Siberian Branch of the Russian Academy of Sciences
ML、Psychology
Z
Zixin Yin
SenseTime Research, Shanghai, China
Y
Yongqiang Yao
Shanghai Jiao Tong University, Shanghai, China
J
Jing Yang
The School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, China