SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

For the INTERSPEECH 2025 MLC-SLM Challenge Track 1 task—multilingual conversational ASR without additional training data—we propose a speech-language joint modeling framework. Our method employs a parallel dual-encoder architecture comprising Whisper-large-v3 and mHuBERT-147, integrated with a large language model (LLM) to enable cross-lingual semantic alignment. We introduce a language-aware prompting mechanism and design a three-stage low-rank fine-tuning strategy coupled with a learnable projection layer to enhance parameter efficiency and language adaptability. On the blind test set, our system achieves a combined CER/WER of 11.76%, representing an absolute reduction of 8.41 percentage points over the official baseline. This result validates the effectiveness and generalizability of joint speech-language modeling with lightweight adaptation in multilingual ASR scenarios.

Technology Category

Application Category

📝 Abstract

This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.

Problem

Research questions and friction points this paper is trying to address.

Develop multilingual conversational speech recognition system

Integrate parallel-speech-encoder with large language model

Improve accuracy over baseline without extra data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel-speech-encoder with Whisper and mHuBERT

Tri-stage training for low-rank adaptation

Language-aware prompt enhances text generation

🔎 Similar Papers

No similar papers found.

Authors to Follow