🤖 AI Summary
This study addresses methodological risks arising from the application of large language models (LLMs) in the qualitative synthesis (QS) phase of systematic reviews—particularly bias amplification and diminished result credibility. Adopting a novel collaborative reflexive research paradigm, it integrates LLM technical mechanism analysis, iterative empirical trials, and methodological critique frameworks to systematically evaluate LLMs’ text summarization capabilities and practical limitations. Findings reveal that inconsistent methodological reporting frequently leads to LLM misuse, while intrinsic structural properties—including training data biases and opaque reasoning processes—significantly undermine the reproducibility and interpretability of synthesized outputs. Based on these insights, the study proposes three foundational principles: human-led oversight, model-assisted execution, and process transparency—highlighting the critical role of embedded supervision and auditable operational logging in safeguarding QS rigor. This work delivers the first reflexively grounded risk identification framework and governance pathway for integrating LLMs into evidence-based research.
📝 Abstract
Large language models (LLMs) show promise for supporting systematic reviews (SR), even complex tasks such as qualitative synthesis (QS). However, applying them to a stage that is unevenly reported and variably conducted carries important risks: misuse can amplify existing weaknesses and erode confidence in the SR findings. To examine the challenges of using LLMs for QS, we conducted a collaborative autoethnography involving two trials. We evaluated each trial for methodological rigor and practical usefulness, and interpreted the results through a technical lens informed by how LLMs are built and their current limitations.