DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of lacking clean, speaker-isolated audio in real-world scenarios by proposing DialogueSidon, a model that jointly performs speech enhancement and speaker separation from monaural two-speaker conversations recorded in the wild. DialogueSidon uniquely integrates a variational autoencoder with a diffusion model to efficiently reconstruct full-duplex, speaker-separated audio tracks in the latent space of self-supervised speech representations. Specifically, it leverages self-supervised learning (SSL) features to compress speech representations and employs a diffusion mechanism to predict the latent variables corresponding to each speaker. Experimental results demonstrate that DialogueSidon significantly improves speech intelligibility and separation quality on English, multilingual, and real-world dialogue datasets, while also achieving substantially faster inference speeds.

Technology Category

Application Category

📝 Abstract
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
Problem

Research questions and friction points this paper is trying to address.

full-duplex dialogue
speech separation
monaural mixture
speaker-wise signals
in-the-wild dialogue
Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex dialogue separation
variational autoencoder
self-supervised learning features
diffusion-based latent prediction
monaural speech restoration
🔎 Similar Papers
No similar papers found.