Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited interpretability of internal representations in current language models that share residual streams between text and speech, which hinders understanding and control of their generative behavior. The study introduces the first application of BatchTopK sparse autoencoders to the TTS language model CosyVoice3, coupled with a modality-aware automated interpretation pipeline, to identify interpretable features in the residual stream associated with text, speech, or both—such as phonemes, laughter, accent cues, and speaker gender. Through latent space interventions, the method enables precise control over TTS outputs: laughter probability increases from 0.02 to 0.79, perceived speaker gender can be flipped, and speech rate adjusted—all while preserving linguistic content—demonstrating the causal efficacy of the extracted features.

📝 Abstract

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

language models

interpretability

representation

residual stream

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Text-to-Speech

Interpretability