🤖 AI Summary
Conventional Goodness of Pronunciation (GOP) scores—derived from softmax posterior probabilities in automatic speech recognition (ASR)—exhibit insufficient sensitivity for second-language (L2) pronunciation error detection. Method: This study systematically evaluates and validates logit-based GOP computation directly from ASR model logits, proposing a max-logit GOP metric and a probability–logit hybrid GOP method that integrates uncertainty modeling and phoneme-weighted scoring. Contribution/Results: Max-logit GOP achieves significantly higher correlation with human perceptual judgments than probability-based GOP (p < 0.01). The hybrid approach enhances robustness while preserving interpretability. Experiments on English pronunciation data from Dutch and Mandarin native speakers demonstrate that logit-based GOP substantially outperforms baseline methods in binary error detection, with max-logit GOP attaining the highest correlation with expert ratings. This work establishes a more accurate and interpretable paradigm for L2 pronunciation assessment.
📝 Abstract
Pronunciation assessment relies on goodness of pronunciation (GOP) scores, traditionally derived from softmax-based posterior probabilities. However, posterior probabilities may suffer from overconfidence and poor phoneme separation, limiting their effectiveness. This study compares logit-based GOP scores with probability-based GOP scores for mispronunciation detection. We conducted our experiment on two L2 English speech datasets spoken by Dutch and Mandarin speakers, assessing classification performance and correlation with human ratings. Logit-based methods outperform probability-based GOP in classification, but their effectiveness depends on dataset characteristics. The maximum logit GOP shows the strongest alignment with human perception, while a combination of different GOP scores balances probability and logit features. The findings suggest that hybrid GOP methods incorporating uncertainty modeling and phoneme-specific weighting improve pronunciation assessment.