๐ค AI Summary
This study addresses the instability and poor calibration of general-purpose multimodal models in screening for obstructive sleep apneaโhypopnea syndrome (OSAHS). To overcome these limitations, the authors propose a two-stage framework: first, facial images are decomposed into seven fixed anatomical queries, and a multimodal foundation model generates structured visual evidence cards encoding anatomical targets, visibility, and risk direction; second, clinical information is fused with these evidence cards to enable an interpretable binary screening decision by a large language model. Evaluated on 642 subjects, the method achieves a sensitivity of 94.86%, an F1 score of 93.74%, and a false negative rate of 5.14%, significantly outperforming baseline approaches while ensuring high sensitivity, auditability, and traceability of diagnostic evidence.
๐ Abstract
Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.