🤖 AI Summary
To address the poor scalability and limited real-time performance of evidence retrieval in automated fact-checking, this paper proposes a lightweight fact indexing and vector quantization (VQ)-based compression co-optimization framework. It is the first work to systematically investigate joint compression of concise fact indexing and dense retrieval over large-scale knowledge sources (e.g., Wikipedia), integrating fact extraction, inverted indexing, and VQ techniques to significantly reduce both storage and computational overhead. Evaluated on HoVer and WiCE benchmarks and real-world 2024 U.S. presidential debate data, the method achieves 10.0× speedup on CPU and over 20.0× on GPU, while maintaining competitive accuracy. Additionally, we release the first publicly available fact-checking dataset specifically designed for debate scenarios. This work bridges a critical gap in research on efficient, deployable evidence retrieval methods for real-time fact-checking systems.
📝 Abstract
The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval. We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.