🤖 AI Summary
To address the challenge of deploying large pre-trained speech models (e.g., Whisper) in low-latency streaming ASR, this paper proposes a prefix-to-prefix fine-tuning framework enabling quasi-monotonic speech-text alignment. Methodologically, it introduces: (1) a novel Continuous Integrate-and-Fire alignment mechanism; (2) Monotonic Finite Look-ahead Attention, enabling tunable latency–accuracy trade-offs; and (3) end-to-end streaming fine-tuning via wait-k decoding. Evaluated across multiple datasets, the approach achieves millisecond-level controllable latency while matching near-offline Whisper accuracy. Theoretically, we prove alignment monotonicity and training stability, establishing— for the first time—the first streaming fine-tuning paradigm for Whisper with strict, configurable latency guarantees.
📝 Abstract
Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.