SemCAFE: When Named Entities make the Difference–Assessing Web Source Reliability through Entity-level Analytics

📅 2025-04-03

🏛️ Web Science Conference

📈 Citations: 1

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the growing challenge of distinguishing credible from deceptive news in digital media—where semantic convergence between authentic and fabricated content obscures reliability—this paper proposes a news source credibility assessment method grounded in named entity semantic association. Departing from conventional approaches reliant on superficial textual features, our method uniquely integrates YAGO knowledge base–driven entity recognition, disambiguation, and semantic relation modeling into credibility classification, constructing article-level “semantic fingerprints” for fine-grained, interpretable reliability evaluation. We combine web page de-templating with robust NLP preprocessing to enhance feature extraction. Evaluated on a Ukraine crisis news dataset (46,020 credible vs. 3,407 unreliable samples), our approach achieves a 12% improvement in macro-F1 over the current state-of-the-art, demonstrating substantial gains in discriminative power and interpretability.

Technology Category

Application Category

📝 Abstract

With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets’ political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI-generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE (Semantically enriched Content Assessment for Fake news Exposure), a system designed to detect news reliability by incorporating entity-relatedness into its assessment. SemCAFE employs standard Natural Language Processing (NLP) techniques, such as boilerplate removal and tokenization, alongside entity-level semantic analysis using the YAGO knowledge base. By creating a “semantic fingerprint” for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state-of-the-art methods. The sample data and code are available on GitHub1.

Problem

Research questions and friction points this paper is trying to address.

Detecting unreliable news articles online

Assessing web source reliability using entity analytics

Distinguishing credible from AI-generated content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-level semantic analysis using YAGO knowledge base

Semantic fingerprint creation for news articles

Boilerplate removal and tokenization preprocessing techniques

🔎 Similar Papers

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains