Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing turn-taking detection models for full-duplex speech interaction suffer from insufficient robustness—either being closed-source and parameter-heavy, or supporting only unimodal (acoustic or linguistic) inputs; moreover, LLM-based approaches rely on scarce, fully annotated full-duplex data. Method: We propose an open-source, lightweight, modular bimodal turn-taking detection model that jointly leverages acoustic features and linguistic representations to classify four states: “speaking,” “listening,” “silence,” and “backchannel tokens.” Contribution/Results: We release Easy Turn—the first large-scale open dataset for full-duplex turn-taking (1,145 hours for training plus a curated test set). On the Easy Turn testset, our model significantly outperforms open-source baselines including TEN and Smart Turn V2, achieving state-of-the-art accuracy. This work establishes a reproducible, scalable foundation for natural human–machine voice interaction.

Technology Category

Application Category

📝 Abstract

Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Integrating acoustic and linguistic modalities for robust turn-taking

Addressing limited open-source turn-taking models with large parameters

Overcoming scarcity of full-duplex data for dialogue systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates acoustic and linguistic bimodal information

Predicts four dialogue turn states accurately

Uses open-source 1145-hour dataset for training

🔎 Similar Papers

No similar papers found.

Authors to Follow