🤖 AI Summary
The capabilities of large language models (LLMs) in automated fact-checking remain poorly understood. Method: This study conducts the first systematic evaluation of mainstream open-source LLMs (e.g., Llama, Mistral) across three core fact-checking tasks: semantic relation identification between claims and verification documents, claim verification grounded in pre-verified texts, and end-to-end fact-checking of raw news claims augmented with external knowledge from Google or Wikipedia—comparing performance against a fine-tuned RoBERTa baseline. Contribution/Results: LLMs achieve performance comparable to or exceeding RoBERTa on relation identification and verified-story verification, yet significantly underperform on end-to-end fact-checking of unseen claims. Crucially, integrating external knowledge via retrieval-augmented generation yields no substantial improvement. These findings empirically delineate the current capability boundaries of LLMs in fact-checking, clarifying their viable application domains and critical limitations—thereby providing actionable insights for future model architecture design and deployment strategies.
📝 Abstract
The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models' accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs' fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs' performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.