YTCommentVerse: A Multi-Category Multi-Lingual YouTube Comment Corpus

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing research is hindered by the lack of a publicly available, large-scale, multilingual, and multi-category YouTube comment dataset, impeding systematic cross-cultural analysis of sentiment, toxicity, and user engagement patterns. To address this, we introduce the largest publicly available multilingual YouTube comment dataset to date—encompassing 15 video categories, over 50 languages, 32 million comments, and 20 million distinct users, enriched with fine-grained metadata (e.g., timestamp, like count, user geo-location, and device information). Our methodology integrates automated web crawling, language identification, video-category classification, and rigorous quality filtering to enable unified, scalable, and structured collection and annotation across languages and categories. The dataset is openly released, substantially bridging a critical gap in social video platform data resources. It serves as a high-quality benchmark for cross-lingual sentiment analysis, multilingual toxicity detection, and user engagement behavior modeling.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce YTCommentVerse, a large-scale multilingual and multi-category dataset of YouTube comments. It contains over 32 million comments from 178,000 videos contributed by more than 20 million unique users spanning 15 distinct YouTube content categories such as Music, News, Education and Entertainment. Each comment in the dataset includes video and comment IDs, user channel details, upvotes and category labels. With comments in over 50 languages, YTCommentVerse provides a rich resource for exploring sentiment, toxicity and engagement patterns across diverse cultural and topical contexts. This dataset helps fill a major gap in publicly available social media datasets particularly for analyzing video sharing platforms by combining multiple languages, detailed categories and other metadata.

Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale multilingual YouTube comment dataset

Analyzing sentiment, toxicity, and engagement across cultures

Addressing the gap in social media datasets for video platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual YouTube comment dataset

Multi-category content classification system

Comprehensive metadata including sentiment analysis

🔎 Similar Papers

Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?