🤖 AI Summary
This work addresses the significant performance degradation of automatic speech recognition systems on non-mainstream accents, a challenge exacerbated by the difficulty of balancing accent-specific adaptation with generalization. The authors propose Moe-CTC, a novel architecture that integrates a mixture-of-experts model with intermediate CTC supervision. During training, an accent-aware routing mechanism encourages expert specialization, while each expert is equipped with a dedicated CTC head to ensure consistent transcription quality. At inference, the system switches to label-free routing to enhance adaptability to unseen accents. The approach innovatively combines dynamic routing with intermediate CTC supervision and introduces a routing-augmented loss function. Evaluated on the McV-Accent benchmark, Moe-CTC consistently outperforms the FastConformer baseline across both high- and low-resource settings, achieving up to a 29.3% relative reduction in word error rate.
📝 Abstract
Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.