The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging task of speaker-unlabeled, time-boundary-free multi-speaker automatic speech recognition (MS-ASR). We propose an end-to-end ASR framework that innovatively integrates speaker embeddings and temporal boundary modeling into the Qwen2.5 large language model (LLM). To enhance speaker discrimination, speech separation awareness, and cross-lingual generalization, we introduce language-specific adapters and apply LoRA-based fine-tuning. Evaluated on the MLC-SLM Challenge dataset, our approach achieves tcpWERs of 23.56% on the development set and 18.08% on the test set—substantially outperforming the official baseline. This marks the first empirical validation of LLM-based architectures for unsupervised MS-ASR, demonstrating both effectiveness and scalability in jointly modeling speaker identity, segmentation, and multilingual recognition without explicit speaker diarization or alignment supervision.

Technology Category

Application Category

📝 Abstract
We present the DKU system for Task 2 of the MLC-SLM Challenge, which aims to perform multi-speaker automatic speech recognition directly from raw audio without Oracle speaker labels or time boundaries. Our approach builds upon a diarization-aware framework integrating speaker embeddings and temporal utterance boundaries into a Qwen2.5-based large language model (LLM). Then, we enhance the system's multilingual performance by fine-tuning language-specific adapters and LoRA modules within the LLM decoder. Finally, our system achieves the tcpWER of 23.56% and 18.08% on the development and test sets of the MLC-SLM dataset, substantially outperforming the official baseline.
Problem

Research questions and friction points this paper is trying to address.

Multi-speaker speech recognition without speaker labels
Diarization-aware framework with speaker embeddings
Enhancing multilingual performance via fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diarization-aware framework with speaker embeddings
Qwen2.5-based LLM with temporal boundaries
Fine-tuned multilingual adapters and LoRA modules
🔎 Similar Papers
No similar papers found.
Yuke Lin
Yuke Lin
Huawei Technologies Co. Ltd
Computer Science
Ming Cheng
Ming Cheng
Dartmouth College
Z
Ze Li
School of Computer Science, Wuhan University, China; Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Digital Innovation Research Center, Duke Kunshan University, China
M
Ming Li
School of Computer Science, Wuhan University, China; Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Digital Innovation Research Center, Duke Kunshan University, China