🤖 AI Summary
This work proposes Uni-ASR, a unified large language model (LLM)-driven end-to-end automatic speech recognition (ASR) architecture that simultaneously supports both non-streaming and streaming modes within a single model without architectural modifications. Existing LLM-based ASR systems often require separate designs for each mode and struggle with deployment in low-latency streaming scenarios. Uni-ASR addresses this limitation through joint training and introduces two key innovations: context-aware training and a zero-latency-cost cooperative fallback decoding mechanism, which together significantly enhance streaming recognition accuracy. Experimental results demonstrate that Uni-ASR achieves state-of-the-art performance in non-streaming settings and consistently outperforms existing methods across various streaming conditions under diverse latency constraints.
📝 Abstract
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.