đ¤ AI Summary
This work addresses the challenge that existing speech large language models lack streaming inference mechanisms, making it difficult to simultaneously achieve real-time decoding and accurate end-of-utterance detection. The authors propose a tightly coupled dual-vocabulary architecture that integrates frame-synchronous acoustic alignment with language model reasoning through a shared audio encoder and transformer branches. The method incorporates chunk-synchronous streaming training, gradient truncation, localized audio attention, causal sliding windows, and zero-overhead score fusion to enable efficient streaming recognition and support for long-form audio. Evaluated on the Open ASR Leaderboard, the approach achieves an average word error rate (WER) of 6.71%, with 8.40% WER under 960ms chunked streaming conditions. It also attains 3.64% and 10.88% WER on TED-LIUM and Earnings-22, respectively, while improving the end-of-utterance detection Fâ score by 0.03.
đ Abstract
Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies -- a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.