๐ค AI Summary
To address the low compression efficiency and poor fidelity of discrete quantization for high-dimensional audio signals, this paper introduces WavTokenizerโthe first efficient discrete encoder-decoder specifically designed for audio language modeling. It innovatively constructs a wide vector-quantized (VQ) codebook space, integrated with an extended-context Transformer, multi-scale GAN discriminators, and an inverse short-time Fourier transform (iSTFT)-based reconstruction architecture, enabling compression of 1-second 24 kHz audio into only 40โ75 semantically rich tokens. WavTokenizer achieves state-of-the-art performance across speech, music, and general audio reconstruction: it attains a new industry-leading UTMOS score (+0.32), improves VQ codebook utilization by 37%, and significantly enhances both perceptual quality and semantic consistency. The model is lightweight, open-source, and natively compatible with downstream audio generation tasks.
๐ Abstract
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.