Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited interpretability of latent spaces in audio generation models. We propose an acoustic semantic parsing framework based on sparse autoencoders (SAEs), trained on the latent representations of prominent audio autoencoders—including DiffRhythm-VAE, EnCodec, and WavTokenizer. By performing linear regression between SAE hidden units and discretized acoustic attributes (e.g., pitch, loudness, timbre), we establish interpretable mappings. Our approach is the first to enable unified interpretability analysis across both continuous and discrete latent spaces, uncovering dynamic semantic evolution pathways during generation. Experiments demonstrate precise disentanglement and controllable manipulation of key acoustic attributes. In text-to-music models such as DiffRhythm, our method clearly reveals the temporal evolution of pitch, timbre, and loudness, significantly enhancing transparency and controllability in audio synthesis.

Technology Category

Application Category

📝 Abstract
While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.
Problem

Research questions and friction points this paper is trying to address.

Extracting interpretable features from dense audio latent representations
Mapping AI music generation to human-understandable acoustic properties
Enabling controllable manipulation and analysis of audio synthesis process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders extract features from audio latents
Linear mappings link features to acoustic properties
Framework enables controllable manipulation and generation analysis
🔎 Similar Papers
No similar papers found.
N
Nathan Paek
Stanford University
Yongyi Zang
Yongyi Zang
Smule, Inc.
Computer AuditionSpeech ProcessingMusic Information RetrievalMusic Composition
Q
Qihui Yang
University of California, San Diego
R
Randal Leistikow
Smule Labs