🤖 AI Summary
This work addresses the overestimation of word error rate (WER) in multi-script automatic speech recognition (ASR) scenarios, where reference and hypothesis transcripts employ different writing systems—such as romanized versus native scripts—leading to inflated error counts that conflate true recognition failures with script mismatches. To resolve this, the authors propose SN-WER, a training-free evaluation metric that incorporates language-specific script normalization to transcribe both reference and hypothesis texts into a standardized script prior to WER computation. Evaluated across five Indian languages, two datasets, and three ASR models, SN-WER reduces apparent performance gaps between systems by up to 12%, mitigates artificial error inflation from manual romanization by 67% in stress tests, and maintains a word collision rate below 0.1%, thereby significantly enhancing the fairness and robustness of ASR evaluation in multi-script settings.
📝 Abstract
Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.