🤖 AI Summary
This work addresses the representational disconnect in existing audio autoencoders between waveform reconstruction and semantic understanding, which hinders their ability to jointly excel at generation and comprehension tasks. The authors propose a unified audio tokenizer that transforms continuous latent variables into structured, generative representations through a noise-regularized bottleneck, channel normalization, and stochastic perturbation—without requiring variational training. By integrating RQ-MTP (Residual Quantization with Masked Token Prediction) training and leveraging semantic supervision from a frozen large language model, the method simultaneously optimizes high-dimensional understanding representations and continuous generative objectives. This approach achieves both high-fidelity audio reconstruction and strong semantic interpretability, effectively unifying high-quality generation with robust comprehension capabilities.
📝 Abstract
Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets