🤖 AI Summary
This work addresses the significant performance gap between Speech-LLMs and end-to-end models in multilingual conversational automatic speech recognition (ASR), as well as the limitations of simple feature concatenation in effectively fusing multimodal speech information. To this end, the authors propose an enhanced Speech-LLM framework that integrates a fine-tuned Whisper encoder and an mHuBERT encoder in parallel, coupled with a cross-attention mechanism to enable deep fusion of speech representations. The study presents the first systematic comparison between Speech-LLMs and end-to-end ASR architectures on the Multilingual Conversational Spoken Language Modeling (MLC-SLM) task under identical data conditions, offering empirical insights for future Speech-LLM design. Using only 1,500 hours of training data, the proposed system achieves a competitive 10.69% CER/WER on the official evaluation set, matching the performance of top-tier Track 1 systems trained on substantially larger datasets.
📝 Abstract
The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.