🤖 AI Summary
This paper addresses the longstanding trade-off between accuracy and inference speed in automatic speech recognition (ASR). We propose HAINAN, a unified multi-paradigm architecture enabling a single model to support autoregressive (AR), non-autoregressive (NAR), and semi-autoregressive (SAR) decoding—achieving state-of-the-art (SOTA) accuracy–latency trade-offs across all three paradigms. Its core innovation is the SAR paradigm: an NAR draft is first generated in parallel, followed by token-wise parallel AR refinement. Key technical contributions include an extended Token-and-Duration Transducer, stochastic masked prediction training, and a hybrid design unifying CTC-style parallel decoding with RNN-T-style sequential modeling. Experiments show that the NAR mode matches CTC’s latency while significantly improving accuracy; the AR mode surpasses both TDT and RNN-T in accuracy; and the SAR mode achieves further accuracy gains with negligible overhead—outperforming TDT on several benchmarks.
📝 Abstract
We present extbf{H}ybrid- extbf{A}utoregressive extbf{IN}ference Tr extbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.