🤖 AI Summary
Existing rumor stance classification datasets suffer from narrow language coverage, insufficient fact-checking claims, and single-annotator labeling—hindering modeling of annotation variability. To address this, we introduce SCRum-9, the first multilingual social media rumor stance dataset covering nine languages, comprising 7,516 tweet–reply pairs and 2,100 associated fact-checking statements. We propose a native-language, multi-annotator collaborative framework to enable fine-grained, cross-lingual, and cross-annotator stance annotation and systematic variability modeling. Data are collected from X (formerly Twitter) using multilingual text and integrated with fact-check provenance techniques. Experimental results demonstrate that state-of-the-art large language models (e.g., Deepseek) and fine-tuned pretrained models achieve significantly degraded performance on SCRum-9, confirming its value as a high-challenge benchmark for advancing robust, multilingual, and human-aligned rumor stance analysis.
📝 Abstract
We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least three native speakers per language, totalling around 405 hours of annotation and 8,150 dollars in compensation. Experiments on SCRum-9 show that it is a challenging benchmark for both state-of-the-art LLMs (e.g. Deepseek) as well as fine-tuned pre-trained models, motivating future work in this area.