🤖 AI Summary
This study addresses critical challenges in radiotherapy treatment plan evaluation—namely, poor protocol adherence, limited interpretability, and high hallucination risk in large language models (LLMs). To this end, we propose the first retrieval-augmented generation (RAG) framework tailored to multi-disease, multi-protocol clinical settings. Methodologically, we build a modular, tool-augmented reasoning system atop LLaMA-4 109B, integrating SentenceTransformer-based semantic retrieval, percentile prediction driven by queue similarity, clinical constraint validation, and a multi-step prompting chain. Crucially, we introduce population-level benchmark scoring into the inference pipeline—enabling traceable, verifiable, and cross-protocol robust evaluations. Experiments demonstrate <2% retrieval error, 100% accuracy for 5th-percentile nearest-neighbor identification, and perfect consistency between end-to-end assessment and independent module outputs—substantially enhancing transparency and clinical trustworthiness.
📝 Abstract
Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans.
Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations.
Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps.
Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.