HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing instructable TTS models suffer from a modality gap between coarse-grained text instructions and fine-grained speech tokens, hindering precise prosodic and phonetic control. To address this, we propose HD-PPT—a Hierarchical Dual-Preference Prompting and Tokenization framework. First, we design a novel hierarchical speech codec that disentangles content-related and instruction-related speech tokens. Second, we introduce a dual-preference token extraction mechanism jointly supervised by ASR and CLAP to align textual instructions with hierarchical speech representations. Third, we establish a layered decoding process enabling controllable generation across semantic, prosodic, and phonemic levels. By integrating large language models, speech codec architectures, ASR, CLAP, and hierarchical modeling, HD-PPT achieves state-of-the-art performance in both instruction adherence and speech naturalness, significantly improving the accuracy and expressiveness of controllable TTS.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained control in instruction-based text-to-speech synthesis

Bridging the modality gap between text instructions and speech tokens

Enabling precise hierarchical control over semantic and acoustic properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decoding strategy for structured token generation

Novel speech codec extracting distinct preference tokens

Bridging modality gap with ASR and CLAP supervision

🔎 Similar Papers

No similar papers found.

Authors to Follow