🤖 AI Summary
Clinicians’ trust in AI recommendations remains a critical barrier to deploying explainable AI (XAI) in sleep medicine, particularly for diagnosing nocturnal arousal events.
Method: We propose a novel “white-box AI as quality control (QC)” paradigm and conducted a multi-stage user study with eight clinical experts, comparing three modes: manual scoring, real-time black-box assistance, and post-hoc white-box QC.
Contribution/Results: This is the first empirical demonstration that both explanation depth and timing of XAI intervention jointly determine human-AI collaboration efficacy. The white-box QC mode improved event-level diagnostic accuracy by ~30%, enhanced count-level consistency, and reduced inter-rater variability. Most experts preferred transparent systems and affirmed their clinical utility. Critically, structured XAI integration outperformed individual expert performance, providing key empirical evidence for trustworthy clinical AI deployment.
📝 Abstract
Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern extit{when} and extit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the extit{start} of scoring or as a post-hoc quality-control ( extit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.