EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing audio tokenizers struggle to simultaneously achieve high-fidelity reconstruction and rich semantic understanding, often sacrificing one for the other. This work proposes a unified discrete audio tokenizer that learns joint semantic-acoustic representations by aligning with textual descriptions, enabling deep integration of both modalities prior to quantization. A flow-matching diffusion decoder is employed to ensure high-quality audio reconstruction. Notably, this approach is the first to encode both semantic and acoustic information into a single, compact token stream, eliminating redundancy and misalignment inherent in multi-stream designs and enabling unified modeling of speech, music, and general audio. Experiments demonstrate up to a 7.4% improvement on MMAR understanding tasks, with reconstruction quality rivaling specialized codecs; a 0.6B-parameter language model built upon this tokenizer outperforms a 13B-parameter continuous-representation LLM using 22× fewer parameters, and its 8B variant sets a new state-of-the-art on MMAR benchmarks.

📝 Abstract

Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.

Problem

Research questions and friction points this paper is trying to address.

audio tokenizer

semantic-acoustic entanglement

audio understanding

audio generation

discrete representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-acoustic entanglement

unified audio tokenizer

caption-aligned representation