Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Uni-ASR, a unified large language model (LLM)-driven end-to-end automatic speech recognition (ASR) architecture that simultaneously supports both non-streaming and streaming modes within a single model without architectural modifications. Existing LLM-based ASR systems often require separate designs for each mode and struggle with deployment in low-latency streaming scenarios. Uni-ASR addresses this limitation through joint training and introduces two key innovations: context-aware training and a zero-latency-cost cooperative fallback decoding mechanism, which together significantly enhance streaming recognition accuracy. Experimental results demonstrate that Uni-ASR achieves state-of-the-art performance in non-streaming settings and consistently outperforms existing methods across various streaming conditions under diverse latency constraints.

Technology Category

Application Category

📝 Abstract
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
Large Language Models
Streaming ASR
Non-Streaming ASR
Low-Latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified ASR
Large Language Models
Streaming Speech Recognition
Context-Aware Training
Fallback Decoding
🔎 Similar Papers
No similar papers found.
Y
Yinfeng Xia
Qwen Applications Business Group, Alibaba, China
J
Jian Tang
Tongyi AI Lab, Alibaba, China
Junfeng Hou
Junfeng Hou
University of Science and Technology of China
G
Gaopeng Xu
Qwen Applications Business Group, Alibaba, China
H
Haitao Yao
Qwen Applications Business Group, Alibaba, China