🤖 AI Summary
To address low drafting efficiency, weak clinical reasoning, and insufficient regulatory compliance in clinical trial documents (e.g., protocols), this study proposes a synergistic framework integrating Retrieval-Augmented Generation (RAG) with commercial large language models (LLMs). Methodologically, we construct a knowledge base unifying structured data from ClinicalTrials.gov and authoritative regulatory guidelines (e.g., ICH-GCP), enabling precise semantic retrieval and controllable generation. Our key contribution is the first systematic validation that RAG significantly enhances LLM performance on critical dimensions—clinical reasoning and reference transparency—where two core metrics improve from ≈40% to ≈80%, while content relevance and terminology accuracy remain consistently above 80%. This framework overcomes the applicability limitations of purely generative models in high-stakes, rigor-critical medical documentation, substantially improving both the usability and regulatory compliance of protocol drafts.
📝 Abstract
BACKGROUND/AIMS
Clinical trials require numerous documents to be written: Protocols, consent forms, clinical study reports, and many others. Large language models offer the potential to rapidly generate first-draft versions of these documents; however, there are concerns about the quality of their output. Here, we report an evaluation of how good large language models are at generating sections of one such document, clinical trial protocols.
METHODS
Using an off-the-shelf large language model, we generated protocol sections for a broad range of diseases and clinical trial phases. Each of these document sections we assessed across four dimensions: Clinical thinking and logic; Transparency and references; Medical and clinical terminology; and Content relevance and suitability. To improve performance, we used the retrieval-augmented generation method to enhance the large language model with accurate up-to-date information, including regulatory guidance documents and data from ClinicalTrials.gov. Using this retrieval-augmented generation large language model, we regenerated the same protocol sections and assessed them across the same four dimensions.
RESULTS
We find that the off-the-shelf large language model delivers reasonable results, especially when assessing content relevance and the correct use of medical and clinical terminology, with scores of over 80%. However, the off-the-shelf large language model shows limited performance in clinical thinking and logic and transparency and references, with assessment scores of ≈40% or less. The use of retrieval-augmented generation substantially improves the writing quality of the large language model, with clinical thinking and logic and transparency and references scores increasing to ≈80%. The retrieval-augmented generation method thus greatly improves the practical usability of large language models for clinical trial-related writing.
DISCUSSION
Our results suggest that hybrid large language model architectures, such as the retrieval-augmented generation method we utilized, offer strong potential for clinical trial-related writing, including a wide variety of documents. This is potentially transformative, since it addresses several major bottlenecks of drug development.