Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a streaming speech recognition method based on decoder-only large language models (LLMs) that achieves significantly reduced latency while maintaining high recognition accuracy. The approach introduces a read-write policy network combined with Monotonic Chunkwise Attention (MoChA) to dynamically segment speech embeddings. During training, speech segments and label sequences are interleaved as input to the LLM, while at inference time, MoChA-triggered signals buffer audio chunks for token generation. The integration of MoChA with the policy network enables a novel minimum-latency training objective, and a joint parameter training strategy is employed for both streaming and non-streaming configurations. Evaluated on AISHELL-1 and AISHELL-2, the method achieves character error rates of 5.1% and 5.5%, respectively, with an average token generation latency reduction of 62.5% and negligible degradation in recognition performance.

Technology Category

Application Category

πŸ“ Abstract
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
Problem

Research questions and friction points this paper is trying to address.

streaming speech recognition
decoder-only LLM
latency optimization
monotonic chunkwise attention
automatic speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

decoder-only LLM
streaming ASR
monotonic chunkwise attention
latency optimization
joint training
πŸ”Ž Similar Papers
No similar papers found.
G
Genshun Wan
University of Science and Technology of China, Hefei, P.R.China
Wenhui Zhang
Wenhui Zhang
Researcher/Software Engineer
Infrastructure and System
J
Jingxuan Zhang
School of Artificial Intelligence and Computer Science, Shaanxi Normal University, Xi’an, P.R. China
S
Shifu Xiong
University of Science and Technology of China, Hefei, P.R.China
J
Jianqing Gao
iFLYTEK Research, iFLYTEK Co., Ltd., Hefei, P.R.China
Z
Zhongfu Ye
University of Science and Technology of China, Hefei, P.R.China