F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the representational disconnect in existing audio autoencoders between waveform reconstruction and semantic understanding, which hinders their ability to jointly excel at generation and comprehension tasks. The authors propose a unified audio tokenizer that transforms continuous latent variables into structured, generative representations through a noise-regularized bottleneck, channel normalization, and stochastic perturbation—without requiring variational training. By integrating RQ-MTP (Residual Quantization with Masked Token Prediction) training and leveraging semantic supervision from a frozen large language model, the method simultaneously optimizes high-dimensional understanding representations and continuous generative objectives. This approach achieves both high-fidelity audio reconstruction and strong semantic interpretability, effectively unifying high-quality generation with robust comprehension capabilities.

📝 Abstract

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

Problem

Research questions and friction points this paper is trying to address.

audio tokenizer

audio autoencoder

latent representation

understanding and generation

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio tokenizer

noise-regularized bottleneck

latent representation