AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural audio codecs typically employ domain-specific codebooks for speech, music, or environmental sounds, hindering unified, efficient representation of general audio at ultra-low bitrates. Method: We propose the first single-codebook neural codec for general audio—including speech, music, and environmental sounds—operating at ~700 bps while reconstructing 16 kHz audio. To reconcile divergent modeling requirements across domains, we introduce a Matryoshka codebook architecture with nested domain-specific partitions, trained via single-stage teacher distillation. We employ a Conformer encoder with STFT-based features and hierarchical knowledge distillation to jointly model heterogeneous audio types within one quantized latent space. Contribution/Results: Experiments demonstrate that our method achieves speech and general audio reconstruction quality on par with state-of-the-art domain-specific single-layer quantizers. This work is the first to empirically validate the feasibility of high-fidelity, single-codebook vector quantization for general audio under ultra-low-bitrate constraints.

Technology Category

Application Category

📝 Abstract
We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed-domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain-specific partitions, assigned with corresponding teacher models to perform distillation, all in a single-stage training. A conformer-style encoder-decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state-of-the-art domain-specific single-layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre-trained model and demo samples are available at https://swivid.github.io/AUV/.
Problem

Research questions and friction points this paper is trying to address.

Teaching universal audio coding with single codebook
Reconstructing speech and general audio efficiently
Achieving high quality at 700 bps bit rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single codebook for universal audio compression
Nested domain partitions with teacher distillation
Conformer encoder-decoder with STFT audio representation
🔎 Similar Papers
2024-10-08IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 0
Yushen Chen
Yushen Chen
Shanghai Jiao Tong University
Speech and Language Processing
K
Kai Hu
Tencent Hunyuan, China
Long Zhou
Long Zhou
Tencent Hunyuan, China
S
Shulin Feng
Tencent Hunyuan, China
X
Xusheng Yang
Peking University, Shenzhen, China
Hangting Chen
Hangting Chen
Tencent Hunyuan
signal processingspeech separationDCASE
X
Xie Chen
X-LANCE Lab, Shanghai Jiao Tong University, China