🤖 AI Summary
This work addresses the challenges of deploying large-scale multilingual speech recognition models on edge devices, which typically suffer from high computational costs and reliance on explicit language identifiers. The authors propose a lightweight, end-to-end CTC architecture that integrates an mHuBERT backbone with a hierarchical LoRA-MoE module. Language-agnostic single-pass decoding is achieved through a language identification (LID) posterior-driven dynamic routing mechanism, eliminating the need for prior language labels. This approach effectively balances shared and language-specific representations by adaptively fusing expert modules. Evaluated on the MSR-86K and MLC-SLM 2025 Challenge datasets, the method matches the performance of state-of-the-art two-stage systems while significantly improving inference efficiency and reducing resource requirements.
📝 Abstract
Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.