🤖 AI Summary
To address the challenges of modeling irregularly sampled, highly incomplete medical time series, this paper proposes the first sequence–image dual-path joint modeling framework: a Transformer architecture processes raw temporal sequences, while concurrently mapping them into Gramian Angular Field (GAF) image representations. We introduce three self-supervised tasks—masked reconstruction, cross-modal contrastive learning, and cross-reconstruction—to enable synergistic optimization of features across modalities. The framework is rigorously evaluated under realistic missing-data scenarios (leave-sensors-out and leave-samples-out), demonstrating strong robustness. Extensive experiments on three clinical datasets show consistent superiority over seven state-of-the-art methods, achieving average classification accuracy improvements of 3.2–5.8 percentage points. This advancement significantly enhances model generalizability and clinical prediction reliability.
📝 Abstract
Medical time series are often irregular and face significant missingness, posing challenges for data analysis and clinical decision-making. Existing methods typically adopt a single modeling perspective, either treating series data as sequences or transforming them into image representations for further classification. In this paper, we propose a joint learning framework that incorporates both sequence and image representations. We also design three self-supervised learning strategies to facilitate the fusion of sequence and image representations, capturing a more generalizable joint representation. The results indicate that our approach outperforms seven other state-of-the-art models in three representative real-world clinical datasets. We further validate our approach by simulating two major types of real-world missingness through leave-sensors-out and leave-samples-out techniques. The results demonstrate that our approach is more robust and significantly surpasses other baselines in terms of classification performance.