Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

📅 2026-01-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance gap between Speech-LLMs and end-to-end models in multilingual conversational automatic speech recognition (ASR), as well as the limitations of simple feature concatenation in effectively fusing multimodal speech information. To this end, the authors propose an enhanced Speech-LLM framework that integrates a fine-tuned Whisper encoder and an mHuBERT encoder in parallel, coupled with a cross-attention mechanism to enable deep fusion of speech representations. The study presents the first systematic comparison between Speech-LLMs and end-to-end ASR architectures on the Multilingual Conversational Spoken Language Modeling (MLC-SLM) task under identical data conditions, offering empirical insights for future Speech-LLM design. Using only 1,500 hours of training data, the proposed system achieves a competitive 10.69% CER/WER on the official evaluation set, matching the performance of top-tier Track 1 systems trained on substantially larger datasets.

Technology Category

Application Category

📝 Abstract
The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.
Problem

Research questions and friction points this paper is trying to address.

multilingual conversational ASR
Speech-LLM
end-to-end ASR
performance gap
feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-LLM
cross-attention fusion
multilingual conversational ASR
end-to-end ASR
Whisper fine-tuning
🔎 Similar Papers
No similar papers found.
Y
Yuxiang Mei
Shanghai Normal University, Shanghai, China
D
Dongxing Xu
Unisound AI Technology Co., Ltd. Beijing, China
J
Jiaen Liang
Unisound AI Technology Co., Ltd. Beijing, China
Yanhua Long
Yanhua Long
Professor, Shanghai Normal University
Speech signal processing