🤖 AI Summary
To address high endpoint detection latency and high truncation error rates in multi-turn spoken dialogue systems, this paper proposes a streaming endpoint detection method leveraging neural audio codec (NAC) features and label-delayed training. It introduces lightweight NAC-derived features—replacing conventional spectral representations—for the first time—and designs a dual-stream architecture with a label-delayed loss function to explicitly mitigate truncation errors while ensuring temporal alignment with both the NAC codec and large speech language models. Experimental results demonstrate that, at a median latency of 160 ms, the single-stream and dual-stream variants reduce truncation errors by 42.7% and 37.5%, respectively. When integrated end-to-end into a speech large language model pipeline, the method reduces overall response latency by 1200 ms and decreases truncation errors by 35%.
📝 Abstract
Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.