Language-Agnostic Modeling of Source Reliability on Wikipedia

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of language-agnostic models for assessing information reliability in multilingual Wikipedia. We propose the first cross-lingual source credibility quantification framework. Methodologically, we introduce “domain permanence”—the temporal stability of edits within a topic domain—as a novel, linguistically invariant predictive feature. Integrated with edit-behavior feature engineering, multilingual domain usage modeling, and transfer learning, our approach enables unified modeling and adaptation across high-, medium-, and low-resource languages. Experiments show that the model achieves macro-F1 scores of 0.80 on high-resource languages and 0.65 on medium-resource ones. Crucially, the permanence feature demonstrates strong cross-lingual and cross-topic generalizability—particularly on contentious subjects such as climate change and COVID-19. Our framework provides a transferable, interpretable foundational tool for misinformation governance and multilingual knowledge credibility assessment.

Technology Category

Application Category

📝 Abstract
Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia's efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Model
Reliability Assessment
Fake News Countermeasure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Modeling
Reliability Assessment
Multilingual Learning
🔎 Similar Papers
No similar papers found.