Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In multi-speaker automatic speech recognition (ASR), Serialized Output Training (SOT) suffers from recognition errors due to speaker assignment failures and relies on hard-to-obtain token-level timestamps for supervision. To address these limitations, this paper proposes Speaker-Distinguishable CTC (SD-CTC), a novel end-to-end framework that jointly models frame-level speech tokens and speaker labels within the CTC architecture—embedding speaker discrimination directly into the CTC alignment mechanism without auxiliary supervision. SD-CTC extends the CTC loss and integrates multi-task learning with the SOT paradigm for joint optimization. Crucially, it operates without requiring timestamp annotations. Experiments show that SD-CTC reduces word error rate by 26% over the SOT baseline and achieves performance on par with state-of-the-art methods that depend on token-level timestamps or other auxiliary information.

Technology Category

Application Category

📝 Abstract

This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-talker speech recognition without auxiliary data

Reducing speaker assignment errors in Serialized Output Training

Enhancing speaker distinction using overlapping speech and transcriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

SD-CTC jointly assigns tokens and speaker labels

Integrates SD-CTC into SOT for speaker distinction

Reduces error rate by 26% without auxiliary info

🔎 Similar Papers

No similar papers found.

Authors to Follow