🤖 AI Summary
Existing instructable TTS models suffer from a modality gap between coarse-grained text instructions and fine-grained speech tokens, hindering precise prosodic and phonetic control. To address this, we propose HD-PPT—a Hierarchical Dual-Preference Prompting and Tokenization framework. First, we design a novel hierarchical speech codec that disentangles content-related and instruction-related speech tokens. Second, we introduce a dual-preference token extraction mechanism jointly supervised by ASR and CLAP to align textual instructions with hierarchical speech representations. Third, we establish a layered decoding process enabling controllable generation across semantic, prosodic, and phonemic levels. By integrating large language models, speech codec architectures, ASR, CLAP, and hierarchical modeling, HD-PPT achieves state-of-the-art performance in both instruction adherence and speech naturalness, significantly improving the accuracy and expressiveness of controllable TTS.
📝 Abstract
Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.