MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the challenge of deploying large pre-trained speech models (e.g., Whisper) in low-latency streaming ASR, this paper proposes a prefix-to-prefix fine-tuning framework enabling quasi-monotonic speech-text alignment. Methodologically, it introduces: (1) a novel Continuous Integrate-and-Fire alignment mechanism; (2) Monotonic Finite Look-ahead Attention, enabling tunable latency–accuracy trade-offs; and (3) end-to-end streaming fine-tuning via wait-k decoding. Evaluated across multiple datasets, the approach achieves millisecond-level controllable latency while matching near-offline Whisper accuracy. Theoretically, we prove alignment monotonicity and training stability, establishing— for the first time—the first streaming fine-tuning paradigm for Whisper with strict, configurable latency guarantees.

Technology Category

Application Category

📝 Abstract

Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.

Problem

Research questions and friction points this paper is trying to address.

Integrating large pre-trained models into streaming speech recognition systems

Establishing quasi-monotonic alignment between speech and text tokens

Achieving controllable latency-quality trade-off in streaming applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-to-prefix training framework for streaming

Continuous Integrate-and-Fire for quasi-monotonic alignment

Monotonic Finite Look-ahead Attention mechanism

🔎 Similar Papers

Mamba for Streaming ASR Combined with Unimodal Aggregation