🤖 AI Summary
To address the limited robustness of automatic speech recognition (ASR) systems against non-native and accented speech, this paper proposes a data-driven mispronunciation modeling approach. First, it leverages attention maps to achieve unsupervised phoneme-level alignment between non-native and native utterances—without requiring prior linguistic knowledge. Based on this alignment, the method automatically discovers phoneme-level mispronunciation patterns and performs end-to-end adaptation of the ASR model. Evaluated on English native speech, the approach reduces word error rate (WER) by 5.7%; on Korean-accented English speech, WER decreases by 12.8%, demonstrating substantial cross-accent generalization. The core contribution lies in repurposing the attention mechanism as an interpretable tool for analyzing pronunciation deviations, thereby establishing a novel paradigm for low-resource accent adaptation.
📝 Abstract
Recent advancements in machine learning have significantly improved speech recognition, but recognizing speech from non-fluent or accented speakers remains a challenge. Previous efforts, relying on rule-based pronunciation patterns, have struggled to fully capture non-native errors. We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. By aligning non-native phones with their native counterparts using attention maps, we achieved a 5.7% improvement in speech recognition on native English datasets and a 12.8% improvement for non-native English speakers, particularly Korean speakers. Our method offers practical advancements for robust Automatic Speech Recognition (ASR) systems particularly for situations where prior linguistic knowledge is not applicable.