DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of lacking clean, speaker-isolated audio in real-world scenarios by proposing DialogueSidon, a model that jointly performs speech enhancement and speaker separation from monaural two-speaker conversations recorded in the wild. DialogueSidon uniquely integrates a variational autoencoder with a diffusion model to efficiently reconstruct full-duplex, speaker-separated audio tracks in the latent space of self-supervised speech representations. Specifically, it leverages self-supervised learning (SSL) features to compress speech representations and employs a diffusion mechanism to predict the latent variables corresponding to each speaker. Experimental results demonstrate that DialogueSidon significantly improves speech intelligibility and separation quality on English, multilingual, and real-world dialogue datasets, while also achieving substantially faster inference speeds.

Technology Category

Application Category

📝 Abstract

Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.

Problem

Research questions and friction points this paper is trying to address.

full-duplex dialogue

speech separation

monaural mixture

speaker-wise signals

in-the-wild dialogue

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex dialogue separation

variational autoencoder

self-supervised learning features

diffusion-based latent prediction

monaural speech restoration

🔎 Similar Papers

No similar papers found.

Authors to Follow