No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural text-to-speech (TTS) systems often rely on transcription-oriented objectives (e.g., CER, NLL) due to the lack of verifiable prosodic rewards, resulting in monotonous, low-naturalness speech; incorporating speaker similarity further destabilizes training and degrades accuracy. Method: We propose an iterative Direct Preference Optimization (DPO) framework grounded in human preferences, requiring only a small set of human-annotated prosodic preference data to directly optimize naturalness—bypassing reward modeling pitfalls and policy collapse common in reinforcement learning. Model regularization ensures training stability. Contribution/Results: Evaluated on the Korean customer-service TTS dataset KoCC-TTS, our method achieves state-of-the-art human preference scores (ELO), competitive CER against mainstream approaches, and substantial gains over GRPO and commercial TTS systems. This work is the first to empirically validate the effectiveness, scalability, and practicality of lightweight human preference optimization for high-fidelity TTS.

Technology Category

Application Category

📝 Abstract
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for extit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an extit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On extbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, extit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at href{https://tts.ch.dev}
Problem

Research questions and friction points this paper is trying to address.

Lack of verifiable reward metrics for prosody optimization in TTS systems
GRPO training collapses prosody into monotone unnatural speech despite low error rates
Need for preference-guided learning to optimize prosodic naturalness with minimal human data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Direct Preference Optimization for prosody
Human preference pairs guide naturalness optimization
Regularization maintains current model performance
🔎 Similar Papers
No similar papers found.
S
Seungyoun Shin
Channel Corporation, Seoul, South Korea
D
Dongha Ahn
Channel Corporation, Seoul, South Korea
Jiwoo Kim
Jiwoo Kim
성균관대학교 인공지능학과
S
Sungwook Jeon
Channel Corporation, Seoul, South Korea