WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

๐Ÿ“… 2024-08-29
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 15
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the low compression efficiency and poor fidelity of discrete quantization for high-dimensional audio signals, this paper introduces WavTokenizerโ€”the first efficient discrete encoder-decoder specifically designed for audio language modeling. It innovatively constructs a wide vector-quantized (VQ) codebook space, integrated with an extended-context Transformer, multi-scale GAN discriminators, and an inverse short-time Fourier transform (iSTFT)-based reconstruction architecture, enabling compression of 1-second 24 kHz audio into only 40โ€“75 semantically rich tokens. WavTokenizer achieves state-of-the-art performance across speech, music, and general audio reconstruction: it attains a new industry-leading UTMOS score (+0.32), improves VQ codebook utilization by 37%, and significantly enhances both perceptual quality and semantic consistency. The model is lightweight, open-source, and natively compatible with downstream audio generation tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
Problem

Research questions and friction points this paper is trying to address.

Efficient audio signal compression
Enhancing subjective audio quality
Optimizing acoustic codec tokenizer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extreme audio compression
Enhanced subjective quality
Multi-scale discriminator integration
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shengpeng Ji
Zhejiang University & Alibaba Group
Ziyue Jiang
Ziyue Jiang
Zhejiang University
Speech Synthesis
X
Xize Cheng
Zhejiang University & Alibaba Group
Y
Yifu Chen
Zhejiang University & Alibaba Group
Minghui Fang
Minghui Fang
Zhejiang University
SpeechMulti-Modal LearningInformation Retrieval
J
Jia-li Zuo
Zhejiang University & Alibaba Group
Q
Qian Yang
Zhejiang University & Alibaba Group
R
Ruiqi Li
Zhejiang University & Alibaba Group
Z
Ziang Zhang
Zhejiang University & Alibaba Group
X
Xiaoda Yang
Zhejiang University & Alibaba Group
Rongjie Huang
Rongjie Huang
FAIR, Zhejiang University
Multimedia ComputingSpeechNatural Language Processing
Yidi Jiang
Yidi Jiang
Ph.D., National University of Singapore
MultimodalMachine LearningSpeech Processing
Q
Qian Chen
Alibaba Group
S
Siqi Zheng
Alibaba Group
W
Wen Wang
Alibaba Group
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing