Endpoint Anticipation for Low-Latency Spoken Dialogue

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency inherent in traditional cascaded spoken dialogue systems, which stems from passive endpoint detection and hinders real-time interaction. To overcome this limitation, the authors propose an endpoint anticipation mechanism that enables the speech model to proactively predict utterance completion, triggering speculative execution of the large language model and text-to-speech synthesis up to 2.56 seconds in advance. This approach establishes the first speculation pipeline in spoken dialogue grounded in partial contextual input and introduces novel metrics to quantify the trade-off between latency reduction and computational redundancy. Integrated into the Unmute framework, the method reduces average system latency by 505 milliseconds across multiple dialogue datasets, at the cost of a 28.4% increase in speculative computation, effectively masking sequential bottlenecks and substantially enhancing real-time complex reasoning capabilities.
📝 Abstract
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.
Problem

Research questions and friction points this paper is trying to address.

low-latency spoken dialogue
turn-completion detection
endpoint anticipation
speech-to-speech interaction
real-time interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Endpoint Anticipation
low-latency dialogue
speculative execution
turn-taking prediction
speech-to-speech interaction
🔎 Similar Papers
No similar papers found.