🤖 AI Summary
This work addresses the challenge of lacking clean, speaker-isolated audio in real-world scenarios by proposing DialogueSidon, a model that jointly performs speech enhancement and speaker separation from monaural two-speaker conversations recorded in the wild. DialogueSidon uniquely integrates a variational autoencoder with a diffusion model to efficiently reconstruct full-duplex, speaker-separated audio tracks in the latent space of self-supervised speech representations. Specifically, it leverages self-supervised learning (SSL) features to compress speech representations and employs a diffusion mechanism to predict the latent variables corresponding to each speaker. Experimental results demonstrate that DialogueSidon significantly improves speech intelligibility and separation quality on English, multilingual, and real-world dialogue datasets, while also achieving substantially faster inference speeds.
📝 Abstract
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.