Whisper Has an Internal Word Aligner

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses Whisper’s lack of precise word-level timestamps by proposing an unsupervised, training-free word alignment method. The core insight is the discovery that specific attention heads in Whisper inherently exhibit word-alignment capability; leveraging this property, the method integrates character-level forced decoding with an attention-weight-based unsupervised filtering mechanism to achieve fine-grained word boundary localization. Crucially, it eliminates reliance on external ASR systems, phoneme modeling, or supervised annotations—common requirements in conventional alignment approaches. Under strict temporal tolerances of 20–100 ms, the method achieves significantly higher word-level alignment accuracy than existing training-free alternatives, setting a new state-of-the-art (SOTA). Moreover, it preserves Whisper’s native architecture and inference efficiency, enabling seamless integration without architectural modification or computational overhead.

Technology Category

Application Category

📝 Abstract

There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.

Problem

Research questions and friction points this paper is trying to address.

Extracting accurate word-level timestamps from Whisper

Identifying internal attention heads for alignment

Producing unsupervised character-based alignments without training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised head filtering for alignments

Character-based teacher forcing method

Attention heads capture word alignments

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper