Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing core challenges—including weak speech-text cross-modal alignment, low speech synthesis quality, and information density mismatch—this paper proposes the Speech-Language Modeling (SLM) framework for unified speech-language representation. Methodologically, it introduces fully decoupled speech and text tokenizers, adopts a multi-token prediction (MTP) paradigm to mitigate semantic density disparities across modalities, and constructs RoleTriviaQA—the first large-scale spoken question-answering benchmark incorporating speaker identity. The framework integrates a speaker-aware generation architecture with an LLM-centric training strategy. Experimental results demonstrate substantial improvements: word error rate drops significantly (6.07 → 3.01), speech synthesis quality and decoding speed improve concurrently (up to 12× acceleration), and on RoleTriviaQA, both knowledge comprehension accuracy and speaker consistency achieve marked gains.

Technology Category

Application Category

📝 Abstract

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$ imes$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Problem

Research questions and friction points this paper is trying to address.

Improving cross-modal alignment in speech-language models

Enhancing speech generation quality with decoupled tokenizers

Addressing speech-text information density mismatch via multi-token prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled tokenizers improve speech-text alignment.

Multi-token prediction accelerates speech decoding.

Speaker-aware generation enhances role-playing consistency.

🔎 Similar Papers

No similar papers found.

Authors to Follow