🤖 AI Summary
Current large language model–driven text-to-speech systems lack interpretable mechanisms for emotional control. This work proposes a sparse autoencoder–based approach to identify emotion-related sparse latent features from semantic hidden states and constructs a feature-level intervention framework that enables bidirectional emotion induction and suppression without fine-tuning the backbone model. For the first time, it reveals that emotional expression is governed by the coordinated action of multiple sparse latent features and establishes their association with pitch-related acoustic attributes. Experimental results demonstrate that the proposed method matches or surpasses existing global-control and TTS baselines in emotional controllability, validating the interpretability and effectiveness of sparse latent features in modulating emotional expression.
📝 Abstract
Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.