Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This paper addresses the longstanding trade-off between accuracy and inference speed in automatic speech recognition (ASR). We propose HAINAN, a unified multi-paradigm architecture enabling a single model to support autoregressive (AR), non-autoregressive (NAR), and semi-autoregressive (SAR) decoding—achieving state-of-the-art (SOTA) accuracy–latency trade-offs across all three paradigms. Its core innovation is the SAR paradigm: an NAR draft is first generated in parallel, followed by token-wise parallel AR refinement. Key technical contributions include an extended Token-and-Duration Transducer, stochastic masked prediction training, and a hybrid design unifying CTC-style parallel decoding with RNN-T-style sequential modeling. Experiments show that the NAR mode matches CTC’s latency while significantly improving accuracy; the AR mode surpasses both TDT and RNN-T in accuracy; and the SAR mode achieves further accuracy gains with negligible overhead—outperforming TDT on several benchmarks.

Technology Category

Application Category

📝 Abstract

We present extbf{H}ybrid- extbf{A}utoregressive extbf{IN}ference Tr extbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.

Problem

Research questions and friction points this paper is trying to address.

Improves speech recognition accuracy and speed

Introduces semi-autoregressive inference paradigm

Balances efficiency and effectiveness in ASR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-Autoregressive INference TrANsducers

Semi-autoregressive inference paradigm

Randomly masked predictor network outputs

🔎 Similar Papers

No similar papers found.