Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of automatic speech recognition systems on non-mainstream accents, a challenge exacerbated by the difficulty of balancing accent-specific adaptation with generalization. The authors propose Moe-CTC, a novel architecture that integrates a mixture-of-experts model with intermediate CTC supervision. During training, an accent-aware routing mechanism encourages expert specialization, while each expert is equipped with a dedicated CTC head to ensure consistent transcription quality. At inference, the system switches to label-free routing to enhance adaptability to unseen accents. The approach innovatively combines dynamic routing with intermediate CTC supervision and introduces a routing-augmented loss function. Evaluated on the McV-Accent benchmark, Moe-CTC consistently outperforms the FastConformer baseline across both high- and low-resource settings, achieving up to a 29.3% relative reduction in word error rate.

Technology Category

Application Category

📝 Abstract
Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.
Problem

Research questions and friction points this paper is trying to address.

accented speech recognition
automatic speech recognition
accent robustness
speech recognition generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Intermediate CTC Supervision
Accent-Robust ASR
Expert Specialization
Routing-Augmented Loss
🔎 Similar Papers
No similar papers found.
Wonjun Lee
Wonjun Lee
POSTECH
Speech RecognitionASRNLPDeep LearningLLM
Hyounghun Kim
Hyounghun Kim
POSTECH
NLPMultimodal Learning
G
Gary Geunbae Lee
Department of Computer Science and Engineering, POSTECH; Graduate School of Artificial Intelligence, POSTECH