Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music synthesis methods rely on high-dimensional latent spaces, resulting in imprecise pitch control and unintuitive timbre manipulation. To address this, we propose a two-stage semi-supervised neural synthesis framework: first, a variational autoencoder learns a disentangled 2D latent representation explicitly encoding pitch and timbre; second, a Transformer conditioned on this low-dimensional, interpretable latent space generates high-fidelity audio. The method preserves music-grade audio quality while significantly improving pitch accuracy and timbre controllability, enabling real-time, intuitive interactive sound editing. Experiments demonstrate fine-grained timbre modulation with sustained high pitch precision. A functional web application validates practical usability. Our core contribution is the first end-to-end synthesis paradigm that jointly achieves precise pitch control, an interpretable timbre latent space, and user-friendly interactivity.

Technology Category

Application Category

📝 Abstract
This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com
Problem

Research questions and friction points this paper is trying to address.

Synthesizes pitch-accurate instrument sounds from intuitive timbre controls
Addresses unintuitive high-dimensional latent spaces in music generation
Enables expressive controllable audio through disentangled timbre representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage semi-supervised learning framework
Pitch-timbre disentangled 2D latent space
Transformer-based generative model for audio
🔎 Similar Papers
No similar papers found.
C
Christian Limberg
Audio Communication Group, Technische Universität Berlin, Berlin, DE
F
Fares Schulz
Audio Communication Group, Technische Universität Berlin, Berlin, DE
Z
Zhe Zhang
Yamagishi Lab, National Institute of Informatics, Tokyo, JPN
Stefan Weinzierl
Stefan Weinzierl
Professor für Audiokommunikation, TU Berlin
AcousticsMusicologyRoom acousticsAudio engineeringMusical acoustics