FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

📅 2024-09-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing speech separation methods suffer from residual interference and audio discontinuities in overlapping speech scenarios—particularly those based on time-frequency masking. To address this, we propose FlowSep, the first generative framework for language-guided audio source separation that introduces Rectified Flow Matching (RFM) into the task. FlowSep models a linear flow trajectory from noise to target-source features within a VAE latent space, replacing conventional discriminative masking with a generative paradigm for Language-Conditioned Audio Source Separation (LASS). The method integrates a pretrained vocoder, mel-spectrogram reconstruction, and language-conditioned feature modeling. Evaluated across multiple benchmarks, FlowSep achieves new state-of-the-art performance: it yields significant gains in objective metrics, surpasses diffusion-based models in subjective quality, and accelerates inference by 3.2×. All results are validated on a large-scale 1,680-hour dataset.

Technology Category

Application Category

📝 Abstract

Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/ .

Problem

Research questions and friction points this paper is trying to address.

Audio Separation

Overlapping Sounds

Time-Frequency Masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

FlowSep

Variational Autoencoder

Language-guided Audio Source Separation

🔎 Similar Papers

No similar papers found.