🤖 AI Summary
Current LLM-based TTS systems rely on discrete speech tokens, limiting fine-grained emotional control: emotion is often reduced to coarse categorical labels, and joint modeling of intensity and local prosodic prominence remains unaddressed. This paper proposes EMORL-TTS, the first framework to jointly model global emotional intensity and local accent positions within a continuous Valence-Arousal-Dominance (VAD) affective space. It integrates supervised fine-tuning with task-driven reinforcement learning to enable zero-shot, multi-dimensional emotional controllability. Key innovations include: (1) continuous VAD-guided intensity modulation; (2) an accent-aware reward function for reinforcement learning; and (3) a lightweight adaptation strategy compatible with mainstream LLM-TTS architectures. Experiments demonstrate significant improvements in emotional accuracy, intensity discrimination, and accent clarity, while maintaining naturalness comparable to state-of-the-art methods.
📝 Abstract
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.