🤖 AI Summary
This study addresses the challenge of cross-speaker, training-free control of emotional expressiveness in large language model–based text-to-speech (TTS) systems. Through ablation studies, the authors identify that emotional prosody is predominantly encoded in speaker x-vectors, enabling the construction of emotion-direction vectors within the x-vector embedding space for continuous modulation of emotional intensity. The work demonstrates, for the first time, that centroid-based vector arithmetic on x-vectors effectively supports cross-speaker and even cross-lingual emotional control in token-based TTS—contrary to prior assumptions that such methods are inapplicable. Experimental results show significant improvements in emotion2vec cosine similarity by 0.29 and 0.09 on English and Brazilian Portuguese, respectively, while maintaining high speaker consistency (SECS ≥ 0.88) and intelligibility (WER ≈ 0).
📝 Abstract
We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.