Whisper Has an Internal Word Aligner

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses Whisper’s lack of precise word-level timestamps by proposing an unsupervised, training-free word alignment method. The core insight is the discovery that specific attention heads in Whisper inherently exhibit word-alignment capability; leveraging this property, the method integrates character-level forced decoding with an attention-weight-based unsupervised filtering mechanism to achieve fine-grained word boundary localization. Crucially, it eliminates reliance on external ASR systems, phoneme modeling, or supervised annotations—common requirements in conventional alignment approaches. Under strict temporal tolerances of 20–100 ms, the method achieves significantly higher word-level alignment accuracy than existing training-free alternatives, setting a new state-of-the-art (SOTA). Moreover, it preserves Whisper’s native architecture and inference efficiency, enabling seamless integration without architectural modification or computational overhead.

Technology Category

Application Category

📝 Abstract
There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.
Problem

Research questions and friction points this paper is trying to address.

Extracting accurate word-level timestamps from Whisper
Identifying internal attention heads for alignment
Producing unsupervised character-based alignments without training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised head filtering for alignments
Character-based teacher forcing method
Attention heads capture word alignments
🔎 Similar Papers
No similar papers found.
Sung-Lin Yeh
Sung-Lin Yeh
CS PhD Student, University of Edinburgh
Speech and Language ProcessingSpeech Recognition
Y
Yen Meng
Centre for Speech Technology Research, University of Edinburgh, UK
H
Hao Tang
Centre for Speech Technology Research, University of Edinburgh, UK