UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

162K/year
πŸ€– AI Summary
This work addresses the limitations of existing semantic speech tokenizers, which overly emphasize linguistic abstraction and consequently lack robust perception of general audio, hindering their applicability to non-speech-centric tasks. To overcome this, the authors propose a unified single-codebook semantic tokenizer that introduces Semantic-Acoustic Primitives (SAPs) as structured supervisory signals. The architecture incorporates a content-aware gating mechanism and a shallow acoustic detail recovery module to jointly model linguistic and acoustic information. This approach preserves high-fidelity speech generation while significantly enhancing the model’s capacity to understand and represent general audio. Empirical results demonstrate consistent outperformance over current single-codebook baselines across diverse audio understanding and generation benchmarks, and the tokenizer functions effectively as a unified audio interface for integration with large language models.
πŸ“ Abstract
Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
Problem

Research questions and friction points this paper is trying to address.

semantic speech tokenizers
acoustic blindness
general audio perception
audio representation
speech-centric tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Acoustic Primitives
Semantic-Acoustic Equilibrium
universal audio tokenizer
audio-LLM interface
single-codebook representation
πŸ”Ž Similar Papers
2024-07-22arXiv.orgCitations: 4
Y
Yuhan Song
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
L
Linhao Zhang
Basic Model Technology Center, WeChat AI, Tencent Inc.
Aiwei Liu
Aiwei Liu
Tsinghua University
Natural Language ProcessingLarge Language modelsAI SafetyWatermarking
Chuhan Wu
Chuhan Wu
WeChat AI, Tencent
Foundation ModelPretrainingPost TrainingLLM Agent
S
Sijun Zhang
Basic Model Technology Center, WeChat AI, Tencent Inc.
W
Wei Jia
Basic Model Technology Center, WeChat AI, Tencent Inc.
Y
Yuan Liu
Basic Model Technology Center, WeChat AI, Tencent Inc.
H
Houfeng Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL