🤖 AI Summary
Deep learning models in medical AI devices frequently generate clinically harmful “hallucinations”—outputs that appear plausible but mislead clinical decision-making—yet these phenomena lack a rigorous, standardized definition and systematic evaluation. Method: We propose the first unified, cross-modal and cross-task definition of hallucination in medical devices: data-driven model outputs exhibiting superficial plausibility yet potentially compromising clinical judgment; we further distinguish *impactful* from *harmless* hallucinations. Through dual-track empirical analysis—spanning imaging tasks (e.g., lesion segmentation) and non-imaging tasks (e.g., physiological parameter prediction)—we develop an integrated theoretical–empirical assessment framework. Contribution/Results: We introduce the first systematic, multi-product-line hallucination taxonomy and quantitative evaluation paradigm; characterize the hallucination spectrum across diverse clinical scenarios; and deliver actionable detection protocols and mitigation strategies—providing a methodological foundation for regulatory review and safety governance of medical AI.
📝 Abstract
Computer methods in medical devices are frequently imperfect and are known to produce errors in clinical or diagnostic tasks. However, when deep learning and data-based approaches yield output that exhibit errors, the devices are frequently said to hallucinate. Drawing from theoretical developments and empirical studies in multiple medical device areas, we introduce a practical and universal definition that denotes hallucinations as a type of error that is plausible and can be either impactful or benign to the task at hand. The definition aims at facilitating the evaluation of medical devices that suffer from hallucinations across product areas. Using examples from imaging and non-imaging applications, we explore how the proposed definition relates to evaluation methodologies and discuss existing approaches for minimizing the prevalence of hallucinations.