🤖 AI Summary
Existing clinical outcome prediction methods fail to capture cross-patient, cross-modal temporal patterns (e.g., abnormal heart rate trends aligned with corresponding clinical text descriptions) from multimodal electronic health records (numerical time series + clinical notes), limiting predictive accuracy and interpretability.
Method: We propose the first cross-modal temporal pattern discovery framework: (1) Slot Attention to learn shared, patient-agnostic temporal representations; (2) Temporal Pattern-based Noise Contrastive Estimation (TPNCE) for fine-grained semantic alignment between modalities; and (3) joint optimization of dual reconstruction losses and multimodal temporal embeddings.
Results: Evaluated on MIMIC-III, our method significantly improves 48-hour in-hospital mortality prediction and 24-hour phenotype classification, achieving an average AUC gain of 3.2% over state-of-the-art baselines. It is the first approach to enable interpretable modeling and effective utilization of cross-patient, cross-modal critical temporal patterns.
📝 Abstract
Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.