Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing turn-taking detection models for full-duplex speech interaction suffer from insufficient robustness—either being closed-source and parameter-heavy, or supporting only unimodal (acoustic or linguistic) inputs; moreover, LLM-based approaches rely on scarce, fully annotated full-duplex data. Method: We propose an open-source, lightweight, modular bimodal turn-taking detection model that jointly leverages acoustic features and linguistic representations to classify four states: “speaking,” “listening,” “silence,” and “backchannel tokens.” Contribution/Results: We release Easy Turn—the first large-scale open dataset for full-duplex turn-taking (1,145 hours for training plus a curated test set). On the Easy Turn testset, our model significantly outperforms open-source baselines including TEN and Smart Turn V2, achieving state-of-the-art accuracy. This work establishes a reproducible, scalable foundation for natural human–machine voice interaction.

Technology Category

Application Category

📝 Abstract
Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Integrating acoustic and linguistic modalities for robust turn-taking
Addressing limited open-source turn-taking models with large parameters
Overcoming scarcity of full-duplex data for dialogue systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates acoustic and linguistic bimodal information
Predicts four dialogue turn states accurately
Uses open-source 1145-hour dataset for training
🔎 Similar Papers
No similar papers found.
G
Guojian Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
C
Chengyou Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
H
Hongfei Xue
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
S
Shuiyuan Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
D
Dehui Gao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Z
Zihan Zhang
Huawei Technologies, China
Yuke Lin
Yuke Lin
Huawei Technologies Co. Ltd
Computer Science
W
Wenjie Li
Huawei Technologies, China
L
Longshuai Xiao
Huawei Technologies, China
Z
Zhonghua Fu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China