Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
ASR systems exhibit insufficient robustness in multi-accent/multi-dialect scenarios, leading to substantial WER increases for languages such as English and Persian. To address this, we propose an accent-agnostic ASR framework comprising three core components: (1) a saliency-driven spectrogram masking method that suppresses accent-specific interfering features; (2) the first publicly available multi-accent Persian speech corpus, accompanied by a systematic evaluation benchmark; and (3) an accent classifier-guided data augmentation strategy that jointly optimizes pre-trained Transformer models. Experimental results demonstrate statistically significant WER reductions on both English and Persian multi-accent test sets, confirming strong cross-accent generalization capability. Our work establishes a novel paradigm and provides critical infrastructure—namely, a curated dataset, standardized benchmark, and architecture—toward robust multilingual ASR, particularly for low-resource languages.

Technology Category

Application Category

📝 Abstract
Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR
Problem

Research questions and friction points this paper is trying to address.

Reducing accent sensitivity in automatic speech recognition systems
Improving ASR robustness against dialectal variations in multilingual contexts
Addressing word error rate elevation in linguistically diverse languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Saliency-driven masking of spectrograms for accent invariance
Accent classification integrated into ASR recognition pipeline
Masked spectrograms used for data augmentation
🔎 Similar Papers
No similar papers found.
M
Mohammad Hossein Sameti
Department of Computer Engineering, Sharif University of Technology
S
Sepehr Harfi Moridani
Department of Computer Engineering, Sharif University of Technology
A
Ali Zarean
Department of Computer Engineering, University of Tehran
Hossein Sameti
Hossein Sameti
Associate Professor, Sharif University of Technology
Speech Recognition and synthesisSpoken Dialogue systems