🤖 AI Summary
Manual expert annotation of news source reliability is infeasible at scale, necessitating automated approaches. Method: This study leverages large language models (LLMs) to assess the credibility of Italian online news publishers across six expert-defined reliability dimensions—including negative framing, sensationalist language, and bias detection—using zero-shot and few-shot prompting on 340 news articles. Human–LLM agreement was rigorously evaluated via Cohen’s Kappa, and LLM-mediated expert disagreement resolution was examined. Contribution/Results: This is the first systematic validation of LLMs for multidimensional news credibility assessment. Results show strong human–LLM agreement on three criteria (κ > 0.6), moderate agreement on two, and demonstrate that LLMs effectively mediate inter-expert disagreement—particularly in negative framing judgments. The LLM generated 6,120 annotations, confirming its viability as a scalable, reliable proxy for expert evaluation in news credibility analysis.
📝 Abstract
In this study, we investigate the use of a large language model to assist in the evaluation of the reliability of the vast number of existing online news publishers, addressing the impracticality of relying solely on human expert annotators for this task. In the context of the Italian news media market, we first task the model with evaluating expert-designed reliability criteria using a representative sample of news articles. We then compare the model's answers with those of human experts. The dataset consists of 340 news articles, each annotated by two human experts and the LLM. Six criteria are taken into account, for a total of 6,120 annotations. We observe good agreement between LLM and human annotators in three of the six evaluated criteria, including the critical ability to detect instances where a text negatively targets an entity or individual. For two additional criteria, such as the detection of sensational language and the recognition of bias in news content, LLMs generate fair annotations, albeit with certain trade-offs. Furthermore, we show that the LLM is able to help resolve disagreements among human experts, especially in tasks such as identifying cases of negative targeting.