Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving both low time-to-first-token (TTFT) latency and high accuracy in delay-sensitive edge-based speech recognition, where conventional full-attention Transformers suffer from excessive latency and computational cost due to their global dependencies. To this end, we propose a traversed streaming encoder based on sliding-window self-attention, which uniquely integrates traversal-based streaming processing with localized attention mechanisms. This approach enables bounded, low-latency inference while preserving strong local contextual modeling capabilities. Experimental results demonstrate that our lightweight model achieves state-of-the-art word error rates on standard benchmarks, significantly outperforms comparable systems in inference speed, and matches the accuracy of substantially larger models with only one-sixth of their parameters, offering an efficient solution for real-time voice interaction on edge devices.

Technology Category

Application Category

📝 Abstract
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent"encode-the-whole-utterance"latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
Problem

Research questions and friction points this paper is trying to address.

latency-critical
automatic speech recognition
streaming ASR
time-to-first-token
edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

sliding-window self-attention
streaming ASR
low-latency inference
ergodic encoder
on-device speech recognition
🔎 Similar Papers