🤖 AI Summary
This study addresses the critical problem that AI model degradation—such as degenerating to random guessing or naive prediction—in clinical trials can severely compromise the reliability of treatment effect estimation. To mitigate this risk, we propose the “AI-as-Supportive-Reader” (AI-SR) human–AI collaboration framework, which embeds AI as an assistive, non-replacement component within radiographic assessment workflows, ensuring human oversight and decision-making dominance under severe AI failure. Evaluated in a randomized controlled trial using spinal X-ray images, AI-SR achieves high diagnostic accuracy while significantly improving robustness and cross-population generalizability. Compared to fully manual assessment and end-to-end AI approaches, AI-SR demonstrates superior cost-efficiency, operational stability, and consistency of trial conclusions. Crucially, even when AI performance degrades to near-random levels, AI-SR preserves unbiased estimation of treatment effects, thereby safeguarding the validity of trial outcomes.
📝 Abstract
Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.