🤖 AI Summary
Existing large language model (LLM)-based text-to-speech (TTS) systems struggle to achieve fine-grained control over emotional intensity, primarily due to the semantic–acoustic gap between text and speech. This work proposes Emo-LiPO, a novel framework that introduces listwise preference optimization (LiPO) into TTS for the first time to explicitly align the emotional intensity ordering of textual prompts with that of synthesized speech. Emo-LiPO learns global intensity rankings across emotion categories for fixed input text, thereby bridging this alignment gap. To support this approach, we construct ESD-plus, the first multi-speaker dataset annotated with explicit emotional intensity labels. Experimental results demonstrate that Emo-LiPO significantly outperforms both supervised learning and direct preference optimization (DPO) baselines on ESD-plus, particularly achieving substantial improvements in the accuracy and controllability of emotional expression at high intensity levels.
📝 Abstract
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.