๐ค AI Summary
This work addresses the limitations of conventional cascaded ASRโLLMโTTS spoken dialogue systems, which rely on voice activity detection (VAD) and support only half-duplex interaction, as well as existing end-to-end VAD-free approaches that struggle to balance conversational intelligence with natural turn-taking. The authors propose a VAD-free, cascaded streaming framework that enables full-duplex interaction by decomposing long utterances into micro-turns and dynamically guiding the large language modelโs (LLM) response timing and turn-taking behavior through control tokens. This approach preserves the strong linguistic capabilities of text-based LLMs while significantly enhancing interaction naturalness. Evaluated on Full-DuplexBench and VoiceBench, the system achieves state-of-the-art performance among open-source solutions in both full-duplex turn-taking and conversational intelligence.
๐ Abstract
Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.