Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) are increasingly deployed for political fact-checking, yet their reliability in high-stakes, sensitive domains—such as pandemic-related claims and U.S. political controversies—remains inadequately assessed. Method: We conduct the first large-scale, cross-model, topic-agnostic empirical audit of GPT-4, Llama 3/3.1, Claude 3.5, and Gemini on 16,513 professionally verified statements, employing an AI auditing framework integrating LDA topic modeling, multivariate regression, and systematic prompt engineering. Contribution/Results: Models significantly outperform chance in detecting false statements but exhibit markedly lower accuracy on true and mixed-accuracy claims; GPT-4 and Gemini achieve highest overall accuracy, yet absolute performance remains limited. Substantial inter-model variation and topic-specific biases emerge, attributable to political skew and uneven coverage in training data. The study reveals both the promise and structural limitations of LLMs for politically sensitive fact-checking, providing rigorous empirical grounding for trustworthy AI governance and model evaluation standards.

Technology Category

Application Category

📝 Abstract

The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' ability to detect political information veracity.

Evaluate performance of five LLMs in fact-checking tasks.

Identify factors affecting LLMs' accuracy in statement evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI auditing methodology evaluates LLMs' fact-checking accuracy.

Topic modeling and regression analysis identify performance factors.

LLMs better detect false statements on sensitive topics.

🔎 Similar Papers

No similar papers found.

Authors to Follow