Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing discrete speech representations struggle to simultaneously preserve linguistic content and prosodic information, often entangled with speaker characteristics, which limits their effectiveness in prosody-sensitive tasks. This work proposes a multitask fine-tuning framework that integrates differentiable K-means clustering to jointly optimize automatic speech recognition and speech reconstruction objectives. For the first time, it explicitly models prosody-aware capabilities within phoneme-level discrete representations, achieving effective disentanglement of linguistic content, prosody, and speaker identity. The resulting phonemic-level tokens significantly outperform existing acoustic or phonetic tokens across multiple downstream tasks, retaining rich phonological and prosodic details while substantially reducing speaker identity leakage.

Technology Category

Application Category

📝 Abstract

In recent years, there has been growing interest in representing speech with discrete tokens, which serve as pseudo-text for speech language models (speechLMs) and as efficient intermediate representations for downstream tasks. These tokens are typically categorized as acoustic and phonetic tokens: the former holds detailed acoustic information for reconstruction while the latter mainly captures linguistic content. In human speech communication, however, unnecessary acoustic details such as speaker information are abstracted, while both linguistic and prosodic information are utilized for speech comprehension and production. Given this, neither type of token seems an ideal representation for tasks sensitive to prosody, such as speechLMs. In this study, we propose the Phonological Tokenizer, a method that fine-tunes phonetic tokens via differentiable k-means with a multi-task objective of ASR and speech resynthesis. Experimental validation on diverse tasks confirms that our tokens retain phonological (both linguistic and prosodic) information while appropriately discarding speaker identity.

Problem

Research questions and friction points this paper is trying to address.

discrete speech representation

prosody

phonetic token

speech language model

speaker identity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phonological Tokenizer

prosody-aware

differentiable k-means

multi-objective fine-tuning

discrete speech representation

🔎 Similar Papers

No similar papers found.

Authors to Follow