A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of erroneous turn-taking prediction in multi-speaker scenarios caused by background speech interference. The authors propose a real-time architecture that integrates dominant speaker tracking with a hierarchical causal end-to-end turn-taking (EOT) prediction framework. The system leverages dominant speaker voice activity segmentation and multi-scale future state probability forecasting, enhanced by task-oriented knowledge distillation from wav2vec 2.0 to a lightweight MFCC-based student model. This approach achieves high performance while substantially reducing model size and latency. Experimental results demonstrate a frame-level F1 score of 82%, a response-word detection F1 of 70.6%, and a binary turn-end classification F1 of 69.3%. The end-to-end turn detection achieves a recall of 87.7% with a median latency of only 36 ms and a model size of merely 1.14 million parameters, outperforming existing Transformer-based baselines.

Technology Category

Application Category

📝 Abstract
We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations. The tracked activity segments are fed to a hierarchical, causal EOT model that predicts the immediate conversational state by independently analyzing per-speaker speech features from both the primary speaker and the bot. Simultaneously, the model anticipates near-future states ($t{+}10/20/30$\,ms) through probabilistic predictions that are aware of the conversation partner's speech. Task-specific knowledge distillation compresses wav2vec~2.0 representations (768\,D) into a compact MFCC-based student (32\,D) for efficient deployment. The system achieves 82\% multi-class frame-level F1 and 70.6\% F1 on Backchannel detection, with 69.3\% F1 on a binary Final vs.\ Others task. On an end-to-end turn-detection benchmark, our model reaches 87.7\% recall vs.\ 58.9\% for Smart Turn~v3 while keeping a median detection latency of 36\,ms versus 800--1300\,ms. Despite using only 1.14\,M parameters, the proposed model matches or exceeds transformer-based baselines while substantially reducing latency and memory footprint, making it suitable for edge deployment.
Problem

Research questions and friction points this paper is trying to address.

turn-taking
speaker segmentation
conversational AI
end-of-turn detection
real-time
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-of-Turn detection
Primary speaker segmentation
Hierarchical modeling
Knowledge distillation
Real-time conversational AI
🔎 Similar Papers
No similar papers found.