🤖 AI Summary
This work addresses Whisper’s lack of precise word-level timestamps by proposing an unsupervised, training-free word alignment method. The core insight is the discovery that specific attention heads in Whisper inherently exhibit word-alignment capability; leveraging this property, the method integrates character-level forced decoding with an attention-weight-based unsupervised filtering mechanism to achieve fine-grained word boundary localization. Crucially, it eliminates reliance on external ASR systems, phoneme modeling, or supervised annotations—common requirements in conventional alignment approaches. Under strict temporal tolerances of 20–100 ms, the method achieves significantly higher word-level alignment accuracy than existing training-free alternatives, setting a new state-of-the-art (SOTA). Moreover, it preserves Whisper’s native architecture and inference efficiency, enabling seamless integration without architectural modification or computational overhead.
📝 Abstract
There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.