Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing core challenges—including weak speech-text cross-modal alignment, low speech synthesis quality, and information density mismatch—this paper proposes the Speech-Language Modeling (SLM) framework for unified speech-language representation. Methodologically, it introduces fully decoupled speech and text tokenizers, adopts a multi-token prediction (MTP) paradigm to mitigate semantic density disparities across modalities, and constructs RoleTriviaQA—the first large-scale spoken question-answering benchmark incorporating speaker identity. The framework integrates a speaker-aware generation architecture with an LLM-centric training strategy. Experimental results demonstrate substantial improvements: word error rate drops significantly (6.07 → 3.01), speech synthesis quality and decoding speed improve concurrently (up to 12× acceleration), and on RoleTriviaQA, both knowledge comprehension accuracy and speaker consistency achieve marked gains.

Technology Category

Application Category

📝 Abstract
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$ imes$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.
Problem

Research questions and friction points this paper is trying to address.

Improving cross-modal alignment in speech-language models
Enhancing speech generation quality with decoupled tokenizers
Addressing speech-text information density mismatch via multi-token prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled tokenizers improve speech-text alignment.
Multi-token prediction accelerates speech decoding.
Speaker-aware generation enhances role-playing consistency.
🔎 Similar Papers
No similar papers found.
Xiaoran Fan
Xiaoran Fan
Fudan University
Z
Zhichao Sun
Fudan University
Y
Yangfan Gao
Fudan University
J
Jingfei Xiong
Fudan University
H
Hang Yan
The Chinese University of Hong Kong
Y
Yifei Cao
Fudan University
J
Jiajun Sun
Fudan University
S
Shuo Li
Fudan University
Z
Zhihao Zhang
Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Y
Yuhao Zhou
Fudan University
Senjie Jin
Senjie Jin
Fudan University
natural language processing
C
Changhao Jiang
Fudan University
J
Junjie Ye
Fudan University
M
Ming Zhang
Fudan University
R
Rui Zheng
Fudan University
Z
Zhenhua Han
Y
Yunke Zhang
Honor Device Co., Ltd
D
Demei Yan
Honor Device Co., Ltd
Shaokang Dong
Shaokang Dong
Honor Device Co., Ltd
Multi-agent RLRLHFLLM Agent
Tao Ji
Tao Ji
中国人民大学
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University