🤖 AI Summary
Existing discrete speech representations struggle to simultaneously preserve linguistic content and prosodic information, often entangled with speaker characteristics, which limits their effectiveness in prosody-sensitive tasks. This work proposes a multitask fine-tuning framework that integrates differentiable K-means clustering to jointly optimize automatic speech recognition and speech reconstruction objectives. For the first time, it explicitly models prosody-aware capabilities within phoneme-level discrete representations, achieving effective disentanglement of linguistic content, prosody, and speaker identity. The resulting phonemic-level tokens significantly outperform existing acoustic or phonetic tokens across multiple downstream tasks, retaining rich phonological and prosodic details while substantially reducing speaker identity leakage.
📝 Abstract
In recent years, there has been growing interest in representing speech with discrete tokens, which serve as pseudo-text for speech language models (speechLMs) and as efficient intermediate representations for downstream tasks. These tokens are typically categorized as acoustic and phonetic tokens: the former holds detailed acoustic information for reconstruction while the latter mainly captures linguistic content. In human speech communication, however, unnecessary acoustic details such as speaker information are abstracted, while both linguistic and prosodic information are utilized for speech comprehension and production. Given this, neither type of token seems an ideal representation for tasks sensitive to prosody, such as speechLMs. In this study, we propose the Phonological Tokenizer, a method that fine-tunes phonetic tokens via differentiable k-means with a multi-task objective of ASR and speech resynthesis. Experimental validation on diverse tasks confirms that our tokens retain phonological (both linguistic and prosodic) information while appropriately discarding speaker identity.