TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core tasks—instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation—without fine-tuning. Methodologically, it introduces a unified, token-based decoder-only Transformer that jointly conditions on MIDI sequences and CLAP-derived text/audio embeddings, generating audio autoregressively at the token level via a neural audio codec (e.g., SoundStream). Crucially, it achieves zero-shot cross-task generalization: a single model supports instrument cloning (from reference audio only), text-driven synthesis (e.g., “violin playing jazz”), and fine-grained timbre editing (e.g., “brighter and softer”). Quantitative and perceptual evaluations demonstrate state-of-the-art performance in audio quality, timbre similarity, and MIDI fidelity. The framework is fully open-sourced, including code, pretrained weights, and interactive demos, validating both its technical efficacy and practical utility.

Technology Category

Application Category

📝 Abstract
Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth
Problem

Research questions and friction points this paper is trying to address.

Neural synthesizer for audio generation
Instrument cloning and text-to-instrument synthesis
Text-guided timbre manipulation without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-based neural synthesizer
Decoder-only transformer audio generation
CLAP embedding for timbre control
🔎 Similar Papers
No similar papers found.