Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large language model (LLM)-based text-to-speech (TTS) systems struggle to achieve fine-grained control over emotional intensity, primarily due to the semantic–acoustic gap between text and speech. This work proposes Emo-LiPO, a novel framework that introduces listwise preference optimization (LiPO) into TTS for the first time to explicitly align the emotional intensity ordering of textual prompts with that of synthesized speech. Emo-LiPO learns global intensity rankings across emotion categories for fixed input text, thereby bridging this alignment gap. To support this approach, we construct ESD-plus, the first multi-speaker dataset annotated with explicit emotional intensity labels. Experimental results demonstrate that Emo-LiPO significantly outperforms both supervised learning and direct preference optimization (DPO) baselines on ESD-plus, particularly achieving substantial improvements in the accuracy and controllability of emotional expression at high intensity levels.

📝 Abstract

Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

Problem

Research questions and friction points this paper is trying to address.

emotion intensity control

text-to-speech

semantic-acoustic gap

fine-grained emotion

LLM-based TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

listwise preference optimization

emotion intensity control

LLM-based TTS