A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of deploying large-scale multilingual speech recognition models on edge devices, which typically suffer from high computational costs and reliance on explicit language identifiers. The authors propose a lightweight, end-to-end CTC architecture that integrates an mHuBERT backbone with a hierarchical LoRA-MoE module. Language-agnostic single-pass decoding is achieved through a language identification (LID) posterior-driven dynamic routing mechanism, eliminating the need for prior language labels. This approach effectively balances shared and language-specific representations by adaptively fusing expert modules. Evaluated on the MSR-86K and MLC-SLM 2025 Challenge datasets, the method matches the performance of state-of-the-art two-stage systems while significantly improving inference efficiency and reducing resource requirements.

Technology Category

Application Category

📝 Abstract
Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.
Problem

Research questions and friction points this paper is trying to address.

multilingual ASR
computational efficiency
language-agnostic decoding
edge deployment
latency reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Agnostic
Hierarchical LoRA-MoE
CTC-based ASR
LID-posterior-driven Routing
Multilingual Speech Recognition
🔎 Similar Papers
No similar papers found.
Y
Yu Zheng
Shanghai Normal University, Shanghai, 200234, China
Y
Yuxiang Mei
Shanghai Normal University, Shanghai, 200234, China
D
Dongxing Xu
Unisound AI Technology Co., Ltd., Beijing, China
J
Jie Chen
Shanghai Normal University, Shanghai, 200234, China
Yanhua Long
Yanhua Long
Professor, Shanghai Normal University
Speech signal processing