🤖 AI Summary
Keyword-based retrieval in systematic literature reviews (SLRs) incurs high manual screening effort and low precision. Method: This paper proposes a semi-automated screening framework driven by multi-large language model (LLM) consensus. It integrates state-of-the-art open-source and commercial LLMs (2024–2025), employs descriptive prompting for paper classification, generates initial labels via a weighted consensus mechanism, and incorporates human-in-the-loop supervision with real-time correction. A visual interactive tool, LLMSurver, enables human-AI collaborative decision-making. Results: Evaluated on over 8,000 real candidate papers, the framework substantially reduces manual screening workload, achieves lower error rates than individual human experts, and demonstrates that modern open-source LLMs deliver sufficient performance—offering high accuracy, strong interpretability, low cost, and broad applicability.
📝 Abstract
The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.