🤖 AI Summary
This work addresses the challenge of detecting corpus-level knowledge inconsistencies in Wikipedia—a problem previously unformalized. We first formally define the task and introduce WIKICOLLIDE, the first real-world benchmark dataset for this purpose. To tackle it, we propose CLARE, a novel system integrating large language model–based agent reasoning, retrieval-augmented generation, and human-in-the-loop annotation to efficiently identify and verify contradictory factual statements. Experimental results demonstrate that human editors assisted by CLARE achieve 87.5% confidence in inconsistency verification and detect 64.7% more contradictions per unit time. The best fully automated model attains an AUROC of 75.1%. Our analysis reveals that at least 3.3% of factual claims in Wikipedia are mutually contradictory—providing empirical evidence and a new methodological framework for assessing knowledge base reliability and enabling dynamic error correction.
📝 Abstract
Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it?
We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.
Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%.
Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.