CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing code-switching speech datasets are often limited by small scale, narrow domain coverage, or artificial construction, failing to capture the natural alternation patterns observed in real-world multilingual environments. This work addresses this gap by presenting CS-YODAS, a large-scale, naturally occurring code-switching speech dataset collected from open-domain YouTube videos, encompassing seven primary languages and totaling 313 hours of audio. The study introduces a scalable human-in-the-loop pipeline to efficiently mine and validate authentic code-switching segments. CS-YODAS exhibits high naturalness and linguistic diversity, and is accompanied by analyses of language-pair distributions, switching patterns, and baseline results for spoken language recognition, thereby significantly advancing research in code-switching speech processing.

📝 Abstract

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.

Problem

Research questions and friction points this paper is trying to address.

code-switching

speech dataset

in-the-wild

multilingual

natural language

Innovation

Methods, ideas, or system contributions that make the work stand out.

code-switching

in-the-wild speech

scalable mining pipeline