MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech

πŸ“… 2025-08-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language model (LLM)-driven text-to-speech (TTS) systems face two key challenges in multidimensional preference alignment: difficulty in multi-objective co-optimization and training degradation in DPO-style methods due to overconfident reward modeling. To address these, we propose Multidimensional Preference Optimization for TTS (MPO-TTS). Our framework constructs a structured, multidimensional preference dataset covering phoneme clarity, speaker similarity, and prosodic consistency, and introduces a gradient-aware KL regularization mechanism to mitigate policy collapse and reward bias. Experiments demonstrate that MPO-TTS significantly outperforms mainstream baselines, achieving improvements of +12.3% in intelligibility, +9.7% in speaker similarity, and +14.1% in prosodic naturalness. Moreover, it attains superior overall alignment in multidimensional human preference evaluations.

Technology Category

Application Category

πŸ“ Abstract
In recent years, text-to-speech (TTS) has seen impressive advancements through large-scale language models, achieving human-level speech quality. Integrating human feedback has proven effective for enhancing robustness in these systems. However, current approaches face challenges in optimizing TTS with preference data across multiple dimensions and often suffer from performance degradation due to overconfidence in rewards. We propose Multidimensional Preference Optimization (MPO) to better align TTS systems with human preferences. MPO introduces a preference set that streamlines the construction of data for multidimensional preference optimization, enabling alignment with multiple dimensions. Additionally, we incorporate regularization during training to address the typical degradation issues in DPO-based approaches. Our experiments demonstrate MPO's effectiveness, showing significant improvements in intelligibility, speaker similarity, and prosody compared to baseline systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizes TTS with multidimensional preference data
Addresses performance degradation from reward overconfidence
Aligns TTS systems with multiple human preference dimensions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidimensional Preference Optimization for TTS alignment
Preference set streamlines multidimensional data construction
Regularization addresses DPO degradation issues
πŸ”Ž Similar Papers
No similar papers found.
K
Kangxiang Xia
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
J
Jixun Yao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China