🤖 AI Summary
This work introduces MGE-LDM—the first unified latent diffusion framework addressing the fragmentation among music generation, source completion, and query-driven source separation. Methodologically, it reformulates all three tasks as conditional inpainting in the latent space, enabled by multi-condition text guidance, cross-dataset heterogeneous alignment, and joint modeling of mixtures, submixes, and isolated sources—achieving fully instrument-agnostic, end-to-end training. Trained jointly on Slakh2100, MUSDB18, and MoisesDB, MGE-LDM supports zero-shot, text-driven separation of arbitrary instruments, controllable mixture generation, and missing-source completion. Experiments demonstrate substantial improvements in audio fidelity and functional consistency over staged baselines. To our knowledge, this is the first framework realizing, within a single model and architecture, instrument-agnostic integration of these three core music signal processing tasks.
📝 Abstract
We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.