Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM)-based text-to-speech (TTS) systems struggle to achieve fine-grained control over emotional intensity, primarily due to the semantic–acoustic gap between text and speech. This work proposes Emo-LiPO, a novel framework that introduces listwise preference optimization (LiPO) into TTS for the first time to explicitly align the emotional intensity ordering of textual prompts with that of synthesized speech. Emo-LiPO learns global intensity rankings across emotion categories for fixed input text, thereby bridging this alignment gap. To support this approach, we construct ESD-plus, the first multi-speaker dataset annotated with explicit emotional intensity labels. Experimental results demonstrate that Emo-LiPO significantly outperforms both supervised learning and direct preference optimization (DPO) baselines on ESD-plus, particularly achieving substantial improvements in the accuracy and controllability of emotional expression at high intensity levels.
📝 Abstract
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.
Problem

Research questions and friction points this paper is trying to address.

emotion intensity control
text-to-speech
semantic-acoustic gap
fine-grained emotion
LLM-based TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

listwise preference optimization
emotion intensity control
LLM-based TTS
learning-to-rank
semantic–acoustic gap
🔎 Similar Papers
2024-09-23IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1