🤖 AI Summary
This work addresses the limited generalization capability of large language models (LLMs) in multilingual automatic speech recognition (ASR) and the challenge of cross-modal alignment by proposing a projector-based LLM-ASR framework that uniquely integrates a Mixture-of-Experts (MoE) architecture with the Continuous Integrate-and-Fire (CIF) mechanism. The MoE component enhances cross-lingual adaptability, while CIF enables dynamic alignment and downsampling between speech and text modalities. This approach substantially improves ASR accuracy and robustness across multiple languages, consistently outperforming strong baseline systems on several benchmarks and demonstrating its effectiveness in achieving both cross-lingual generalization and modality consistency.
📝 Abstract
The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.