AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing neural audio codecs typically employ domain-specific codebooks for speech, music, or environmental sounds, hindering unified, efficient representation of general audio at ultra-low bitrates. Method: We propose the first single-codebook neural codec for general audio—including speech, music, and environmental sounds—operating at ~700 bps while reconstructing 16 kHz audio. To reconcile divergent modeling requirements across domains, we introduce a Matryoshka codebook architecture with nested domain-specific partitions, trained via single-stage teacher distillation. We employ a Conformer encoder with STFT-based features and hierarchical knowledge distillation to jointly model heterogeneous audio types within one quantized latent space. Contribution/Results: Experiments demonstrate that our method achieves speech and general audio reconstruction quality on par with state-of-the-art domain-specific single-layer quantizers. This work is the first to empirically validate the feasibility of high-fidelity, single-codebook vector quantization for general audio under ultra-low-bitrate constraints.

Technology Category

Application Category

📝 Abstract

We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed-domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain-specific partitions, assigned with corresponding teacher models to perform distillation, all in a single-stage training. A conformer-style encoder-decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state-of-the-art domain-specific single-layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre-trained model and demo samples are available at https://swivid.github.io/AUV/.

Problem

Research questions and friction points this paper is trying to address.

Teaching universal audio coding with single codebook

Reconstructing speech and general audio efficiently

Achieving high quality at 700 bps bit rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single codebook for universal audio compression

Nested domain partitions with teacher distillation

Conformer encoder-decoder with STFT audio representation

🔎 Similar Papers

Variable Bitrate Residual Vector Quantization for Audio Coding

2024-10-08IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 0

Authors to Follow