Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization capability of large language models (LLMs) in multilingual automatic speech recognition (ASR) and the challenge of cross-modal alignment by proposing a projector-based LLM-ASR framework that uniquely integrates a Mixture-of-Experts (MoE) architecture with the Continuous Integrate-and-Fire (CIF) mechanism. The MoE component enhances cross-lingual adaptability, while CIF enables dynamic alignment and downsampling between speech and text modalities. This approach substantially improves ASR accuracy and robustness across multiple languages, consistently outperforming strong baseline systems on several benchmarks and demonstrating its effectiveness in achieving both cross-lingual generalization and modality consistency.
📝 Abstract
The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.
Problem

Research questions and friction points this paper is trying to address.

multilingual ASR
large language models
modality alignment
cross-lingual generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
Dynamic Downsampling
Modality Alignment
Multilingual ASR
LLM-based Speech Recognition
🔎 Similar Papers
No similar papers found.