Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the significant performance degradation of automatic speech recognition systems on non-mainstream accents, a challenge exacerbated by the difficulty of balancing accent-specific adaptation with generalization. The authors propose Moe-CTC, a novel architecture that integrates a mixture-of-experts model with intermediate CTC supervision. During training, an accent-aware routing mechanism encourages expert specialization, while each expert is equipped with a dedicated CTC head to ensure consistent transcription quality. At inference, the system switches to label-free routing to enhance adaptability to unseen accents. The approach innovatively combines dynamic routing with intermediate CTC supervision and introduces a routing-augmented loss function. Evaluated on the McV-Accent benchmark, Moe-CTC consistently outperforms the FastConformer baseline across both high- and low-resource settings, achieving up to a 29.3% relative reduction in word error rate.

Technology Category

Application Category

📝 Abstract

Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

Problem

Research questions and friction points this paper is trying to address.

accented speech recognition

automatic speech recognition

accent robustness

speech recognition generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Intermediate CTC Supervision

Accent-Robust ASR

Expert Specialization

Routing-Augmented Loss

🔎 Similar Papers

No similar papers found.

Authors to Follow