🤖 AI Summary
This work exposes latent cultural biases in hate speech (HS) datasets: conventional language-based curation overlooks intralingual cultural heterogeneity across nations, resulting in severe geographic representation imbalance. To address this, the study pioneers the use of geographic metadata as a cultural proxy and systematically evaluates 27 mainstream HS datasets across eight languages—particularly English, Arabic, and Spanish—along joint language-geographic dimensions. Through meta-analysis, geotagged tweet modeling, and comparative assessment of population and social media platform coverage, we find that while geographic skew in English datasets has modestly decreased recently, English, Arabic, and Spanish datasets remain overwhelmingly concentrated in the U.S. and U.K., diverging significantly from global speaker distributions and platform engagement patterns. Based on these findings, we propose principled dataset construction guidelines calibrated to national population size and social media coverage, advancing HS research from a “language-centric” to a “culturally aware” paradigm.
📝 Abstract
Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages—English, Arabic and Spanish—we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e.g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages. Based on these findings, we formulate recommendations for the creation of future HS datasets.