🤖 AI Summary
This work addresses the high latency inherent in traditional cascaded spoken dialogue systems, which stems from passive endpoint detection and hinders real-time interaction. To overcome this limitation, the authors propose an endpoint anticipation mechanism that enables the speech model to proactively predict utterance completion, triggering speculative execution of the large language model and text-to-speech synthesis up to 2.56 seconds in advance. This approach establishes the first speculation pipeline in spoken dialogue grounded in partial contextual input and introduces novel metrics to quantify the trade-off between latency reduction and computational redundancy. Integrated into the Unmute framework, the method reduces average system latency by 505 milliseconds across multiple dialogue datasets, at the cost of a 28.4% increase in speculative computation, effectively masking sequential bottlenecks and substantially enhancing real-time complex reasoning capabilities.
📝 Abstract
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.