🤖 AI Summary
This study addresses the challenge of estimating the prevalence of false or misleading information on social media, which is often confounded by multiple sources of uncertainty—including sampling variability, inter-annotator disagreement in manual labeling, and ambiguity in keyword-based retrieval. To improve the robustness of mitigation strategies, this work proposes a unified framework that jointly quantifies these three uncertainty sources by integrating multinomial simulation with Bootstrap resampling. The approach yields uncertainty-aware prevalence estimates evaluated on multi-platform, multilingual datasets annotated by professional fact-checkers. Empirical results demonstrate that uncertainty stemming from keyword retrieval can substantially exceed baseline variability, leading to markedly wider confidence intervals. These findings underscore the necessity and value of jointly modeling all major uncertainty sources to enhance the reliability and robustness of misinformation prevalence estimation.
📝 Abstract
Estimation of mis/disinformation prevalence in social media is crucial for designing mitigation strategies to limit its impact. Yet, such estimations are subject to several uncertainties that are rarely quantified jointly. In this study, we present a methodological contribution in which confidence intervals were used to quantify uncertainties related to mis/disinformation prevalence. The analysis draws on a multi-platform, multilingual dataset annotated by professional fact-checkers. Data were collected between March and April 2025 from Facebook, Instagram, LinkedIn, TikTok, X/Twitter, and YouTube across four EU Member States (France, Poland, Slovakia, and Spain). We account for different causes of uncertainty: (i) sample uncertainty, (ii) annotation uncertainty arising from human disagreement and misclassification, and (iii) data retrieval uncertainty induced by keyword-based data collection. First, we estimate the uncertainty arising from the different causes separately using confidence intervals, simulation-based methods, and bootstrapping. Finally, we combined multinomial simulations of annotator behaviour with keyword and post-resampling to capture the joint impact of measurement uncertainty on mis/disinformation prevalence estimates. The proposed methodological approach highlights the importance of uncertainty-aware estimation of mis/disinformation prevalence for robust analysis. The empirical results of this study show that keyword-based data retrieval can exceed baseline variability, leading to wider confidence intervals around prevalence estimates.