π€ AI Summary
Large language model (LLM)-driven text-to-speech (TTS) systems face two key challenges in multidimensional preference alignment: difficulty in multi-objective co-optimization and training degradation in DPO-style methods due to overconfident reward modeling. To address these, we propose Multidimensional Preference Optimization for TTS (MPO-TTS). Our framework constructs a structured, multidimensional preference dataset covering phoneme clarity, speaker similarity, and prosodic consistency, and introduces a gradient-aware KL regularization mechanism to mitigate policy collapse and reward bias. Experiments demonstrate that MPO-TTS significantly outperforms mainstream baselines, achieving improvements of +12.3% in intelligibility, +9.7% in speaker similarity, and +14.1% in prosodic naturalness. Moreover, it attains superior overall alignment in multidimensional human preference evaluations.
π Abstract
In recent years, text-to-speech (TTS) has seen impressive advancements through large-scale language models, achieving human-level speech quality. Integrating human feedback has proven effective for enhancing robustness in these systems. However, current approaches face challenges in optimizing TTS with preference data across multiple dimensions and often suffer from performance degradation due to overconfidence in rewards. We propose Multidimensional Preference Optimization (MPO) to better align TTS systems with human preferences. MPO introduces a preference set that streamlines the construction of data for multidimensional preference optimization, enabling alignment with multiple dimensions. Additionally, we incorporate regularization during training to address the typical degradation issues in DPO-based approaches. Our experiments demonstrate MPO's effectiveness, showing significant improvements in intelligibility, speaker similarity, and prosody compared to baseline systems.