MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large language model (LLM)-driven text-to-speech (TTS) systems face two key challenges in multidimensional preference alignment: difficulty in multi-objective co-optimization and training degradation in DPO-style methods due to overconfident reward modeling. To address these, we propose Multidimensional Preference Optimization for TTS (MPO-TTS). Our framework constructs a structured, multidimensional preference dataset covering phoneme clarity, speaker similarity, and prosodic consistency, and introduces a gradient-aware KL regularization mechanism to mitigate policy collapse and reward bias. Experiments demonstrate that MPO-TTS significantly outperforms mainstream baselines, achieving improvements of +12.3% in intelligibility, +9.7% in speaker similarity, and +14.1% in prosodic naturalness. Moreover, it attains superior overall alignment in multidimensional human preference evaluations.

Technology Category

Application Category

📝 Abstract

In recent years, text-to-speech (TTS) has seen impressive advancements through large-scale language models, achieving human-level speech quality. Integrating human feedback has proven effective for enhancing robustness in these systems. However, current approaches face challenges in optimizing TTS with preference data across multiple dimensions and often suffer from performance degradation due to overconfidence in rewards. We propose Multidimensional Preference Optimization (MPO) to better align TTS systems with human preferences. MPO introduces a preference set that streamlines the construction of data for multidimensional preference optimization, enabling alignment with multiple dimensions. Additionally, we incorporate regularization during training to address the typical degradation issues in DPO-based approaches. Our experiments demonstrate MPO's effectiveness, showing significant improvements in intelligibility, speaker similarity, and prosody compared to baseline systems.

Problem

Research questions and friction points this paper is trying to address.

Optimizes TTS with multidimensional preference data

Addresses performance degradation from reward overconfidence

Aligns TTS systems with multiple human preference dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidimensional Preference Optimization for TTS alignment

Preference set streamlines multidimensional data construction

Regularization addresses DPO degradation issues

🔎 Similar Papers

No similar papers found.