DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limitations of conventional cascaded ASR–LLM–TTS spoken dialogue systems, which rely on voice activity detection (VAD) and support only half-duplex interaction, as well as existing end-to-end VAD-free approaches that struggle to balance conversational intelligence with natural turn-taking. The authors propose a VAD-free, cascaded streaming framework that enables full-duplex interaction by decomposing long utterances into micro-turns and dynamically guiding the large language model’s (LLM) response timing and turn-taking behavior through control tokens. This approach preserves the strong linguistic capabilities of text-based LLMs while significantly enhancing interaction naturalness. Evaluated on Full-DuplexBench and VoiceBench, the system achieves state-of-the-art performance among open-source solutions in both full-duplex turn-taking and conversational intelligence.

Technology Category

Application Category

📝 Abstract

Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.

Problem

Research questions and friction points this paper is trying to address.

full-duplex

speech-to-speech dialogue

VAD-free

conversational intelligence

turn-taking

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex

VAD-free

micro-turn

cascaded ASR-LLM-TTS

conversational control tokens

🔎 Similar Papers

No similar papers found.