Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-speaker automatic speech recognition (ASR), Serialized Output Training (SOT) suffers from recognition errors due to speaker assignment failures and relies on hard-to-obtain token-level timestamps for supervision. To address these limitations, this paper proposes Speaker-Distinguishable CTC (SD-CTC), a novel end-to-end framework that jointly models frame-level speech tokens and speaker labels within the CTC architecture—embedding speaker discrimination directly into the CTC alignment mechanism without auxiliary supervision. SD-CTC extends the CTC loss and integrates multi-task learning with the SOT paradigm for joint optimization. Crucially, it operates without requiring timestamp annotations. Experiments show that SD-CTC reduces word error rate by 26% over the SOT baseline and achieves performance on par with state-of-the-art methods that depend on token-level timestamps or other auxiliary information.

Technology Category

Application Category

📝 Abstract
This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-talker speech recognition without auxiliary data
Reducing speaker assignment errors in Serialized Output Training
Enhancing speaker distinction using overlapping speech and transcriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

SD-CTC jointly assigns tokens and speaker labels
Integrates SD-CTC into SOT for speaker distinction
Reduces error rate by 26% without auxiliary info
🔎 Similar Papers
No similar papers found.
A
Asahi Sakuma
NHK Science and Technology Research Laboratories, Japan
H
Hiroaki Sato
NHK Science and Technology Research Laboratories, Japan
R
Ryuga Sugano
NHK Science and Technology Research Laboratories, Japan
T
Tadashi Kumano
NHK Science and Technology Research Laboratories, Japan
Y
Yoshihiko Kawai
NHK Science and Technology Research Laboratories, Japan
Tetsuji Ogawa
Tetsuji Ogawa
Waseda University
Pattern RecognitionAcoustics and Speech ProcessingInternet of ThingsPrognostics and Health Management