🤖 AI Summary
This work addresses the singing-voice separation task in real-world music recordings by proposing a generative diffusion model conditioned on the mixture audio. The method is trained via time-frequency masking modeling and augmented with complementary data enhancement strategies, enabling users to flexibly adjust the number of denoising steps and noise scheduling during iterative sampling—thereby achieving a controllable trade-off between separation quality and inference efficiency. Compared to conventional generative approaches, the framework significantly improves generalization and interactive controllability. Objectively, it matches the performance of state-of-the-art non-generative baselines, and ablation studies confirm the substantial impact of sampling parameters on separation fidelity. The core contribution lies in the first end-to-end integration of conditional diffusion mechanisms for singing-voice separation, jointly optimizing perceptual fidelity, robustness to real-world distortions, and user-adjustable inference behavior.
📝 Abstract
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.