Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

📅 2026-01-04

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the significant performance gap between Speech-LLMs and end-to-end models in multilingual conversational automatic speech recognition (ASR), as well as the limitations of simple feature concatenation in effectively fusing multimodal speech information. To this end, the authors propose an enhanced Speech-LLM framework that integrates a fine-tuned Whisper encoder and an mHuBERT encoder in parallel, coupled with a cross-attention mechanism to enable deep fusion of speech representations. The study presents the first systematic comparison between Speech-LLMs and end-to-end ASR architectures on the Multilingual Conversational Spoken Language Modeling (MLC-SLM) task under identical data conditions, offering empirical insights for future Speech-LLM design. Using only 1,500 hours of training data, the proposed system achieves a competitive 10.69% CER/WER on the official evaluation set, matching the performance of top-tier Track 1 systems trained on substantially larger datasets.

Technology Category

Application Category

📝 Abstract

The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.

Problem

Research questions and friction points this paper is trying to address.

multilingual conversational ASR

Speech-LLM

end-to-end ASR

performance gap

feature fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-LLM

cross-attention fusion

multilingual conversational ASR