Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the impact of speech segmentation width and discrete unit clustering scale on Speech Language Model (SLM) performance. We propose a unified tokenization framework integrating fixed- or variable-width segmentation with multi-scale K-means clustering. Our analysis reveals, for the first time, a synergistic benefit between medium-granularity segmentation and large-scale clustering (>10k units); moreover, multi-token combinations effectively capture fine-grained spoken semantics. On zero-shot Spoken Language Understanding (SLU), the optimal configuration reduces training data requirements by 50% and training time by 70%, while substantially outperforming state-of-the-art methods across multiple benchmarks. The core contribution lies in establishing fundamental trade-offs in speech token design and empirically validating that high-capacity discrete representations yield substantial gains for low-resource SLM training.

Technology Category

Application Category

📝 Abstract
The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.
Problem

Research questions and friction points this paper is trying to address.

Investigates impact of segmentation width on speech tokenization
Examines effect of vocabulary cluster size on SLM performance
Optimizes tokenization efficiency for spoken language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates segmentation width and cluster size effects
Uses K-means models with varied cluster sizes
Combines tokens for fine-grained understanding enhancement
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4
Shunsuke Kando
Shunsuke Kando
The University of Tokyo
Natural Language ProcessingSpoken Language Processing
Y
Yusuke Miyao
Graduate School of Information Science and Technology, The University of Tokyo, Japan
Shinnosuke Takamichi
Shinnosuke Takamichi
Keio University
Speech synthesis