Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address high endpoint detection latency and high truncation error rates in multi-turn spoken dialogue systems, this paper proposes a streaming endpoint detection method leveraging neural audio codec (NAC) features and label-delayed training. It introduces lightweight NAC-derived features—replacing conventional spectral representations—for the first time—and designs a dual-stream architecture with a label-delayed loss function to explicitly mitigate truncation errors while ensuring temporal alignment with both the NAC codec and large speech language models. Experimental results demonstrate that, at a median latency of 160 ms, the single-stream and dual-stream variants reduce truncation errors by 42.7% and 37.5%, respectively. When integrated end-to-end into a speech large language model pipeline, the method reduces overall response latency by 1200 ms and decreases truncation errors by 35%.

Technology Category

Application Category

📝 Abstract

Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.

Problem

Research questions and friction points this paper is trying to address.

Real-time speech endpointing for multi-turn dialogues

Reducing cutoff errors with label delay training

Integrating with codec-based pretrained speech language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Neural Audio Codec features

Introduces label delay training

Integrates with speech language model

🔎 Similar Papers

No similar papers found.