MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying large pre-trained speech models (e.g., Whisper) in low-latency streaming ASR, this paper proposes a prefix-to-prefix fine-tuning framework enabling quasi-monotonic speech-text alignment. Methodologically, it introduces: (1) a novel Continuous Integrate-and-Fire alignment mechanism; (2) Monotonic Finite Look-ahead Attention, enabling tunable latency–accuracy trade-offs; and (3) end-to-end streaming fine-tuning via wait-k decoding. Evaluated across multiple datasets, the approach achieves millisecond-level controllable latency while matching near-offline Whisper accuracy. Theoretically, we prove alignment monotonicity and training stability, establishing— for the first time—the first streaming fine-tuning paradigm for Whisper with strict, configurable latency guarantees.

Technology Category

Application Category

📝 Abstract
Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between latency and quality, making it suitable for various streaming applications.
Problem

Research questions and friction points this paper is trying to address.

Integrating large pre-trained models into streaming speech recognition systems
Establishing quasi-monotonic alignment between speech and text tokens
Achieving controllable latency-quality trade-off in streaming applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-to-prefix training framework for streaming
Continuous Integrate-and-Fire for quasi-monotonic alignment
Monotonic Finite Look-ahead Attention mechanism
🔎 Similar Papers
No similar papers found.
Y
Yinfeng Xia
Honor Device Co, Ltd, China
H
Huiyan Li
Honor Device Co, Ltd, China
Chenyang Le
Chenyang Le
Shanghai Jiaotong University
M
Manhong Wang
Honor Device Co, Ltd, China
Yutao Sun
Yutao Sun
Tsinghua University
Natural Language ProcessingMachine Learning
X
Xingyang Ma
Honor Device Co, Ltd, China
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning