Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited generalization capability of large language models (LLMs) in multilingual automatic speech recognition (ASR) and the challenge of cross-modal alignment by proposing a projector-based LLM-ASR framework that uniquely integrates a Mixture-of-Experts (MoE) architecture with the Continuous Integrate-and-Fire (CIF) mechanism. The MoE component enhances cross-lingual adaptability, while CIF enables dynamic alignment and downsampling between speech and text modalities. This approach substantially improves ASR accuracy and robustness across multiple languages, consistently outperforming strong baseline systems on several benchmarks and demonstrating its effectiveness in achieving both cross-lingual generalization and modality consistency.

📝 Abstract

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

Problem

Research questions and friction points this paper is trying to address.

multilingual ASR

large language models

modality alignment

cross-lingual generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts

Dynamic Downsampling

Modality Alignment

Multilingual ASR