UniCoM: A Universal Code-Switching Speech Generator

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code-switching (CS) is prevalent in real-world multilingual conversations, yet current speech technologies are hindered by the scarcity of high-quality CS data. To address this, we propose SWORDS—a novel cross-lingual lexical substitution algorithm that enforces part-of-speech constraints and semantic equivalence to ensure grammatical correctness, semantic fidelity, and naturalness in CS speech synthesis. Leveraging SWORDS, we develop a general-purpose code-mixing pipeline and construct CS-FLEURS, a large-scale, multilingual speech dataset covering diverse language pairs. Experimental evaluations demonstrate that CS-FLEURS achieves high intelligibility and naturalness in automatic speech recognition (ASR) and speech translation tasks—matching the performance of leading monolingual benchmarks. CS-FLEURS thus establishes the first large-scale, controllable, and high-fidelity CS benchmark for multilingual speech modeling.

Technology Category

Application Category

📝 Abstract
Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.
Problem

Research questions and friction points this paper is trying to address.

Generating natural code-switching speech without altering semantics
Addressing scarcity of datasets for multilingual code-switching technology
Creating high-quality code-switching corpus for ASR and translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Substituting words with synonyms algorithm
Generates code-switching speech without altering semantics
Multilingual corpus for speech recognition and translation
🔎 Similar Papers
No similar papers found.
S
Sangmin Lee
Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea, Seoul
W
Woojin Chung
Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea, Seoul
S
Seyun Um
Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea, Seoul
Hong-Goo Kang
Hong-Goo Kang
Yonsei University