MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work introduces MGE-LDM—the first unified latent diffusion framework addressing the fragmentation among music generation, source completion, and query-driven source separation. Methodologically, it reformulates all three tasks as conditional inpainting in the latent space, enabled by multi-condition text guidance, cross-dataset heterogeneous alignment, and joint modeling of mixtures, submixes, and isolated sources—achieving fully instrument-agnostic, end-to-end training. Trained jointly on Slakh2100, MUSDB18, and MoisesDB, MGE-LDM supports zero-shot, text-driven separation of arbitrary instruments, controllable mixture generation, and missing-source completion. Experiments demonstrate substantial improvements in audio fidelity and functional consistency over staged baselines. To our knowledge, this is the first framework realizing, within a single model and architecture, instrument-agnostic integration of these three core music signal processing tasks.

Technology Category

Application Category

📝 Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.

Problem

Research questions and friction points this paper is trying to address.

Simultaneous music generation and source extraction

Flexible manipulation of arbitrary instrument sources

Joint training across heterogeneous multi-track datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint latent diffusion for music tasks

Flexible class-agnostic source manipulation

Training across heterogeneous datasets

🔎 Similar Papers

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models