π€ AI Summary
Existing multimodal clinical prediction approaches rely on static fusion strategies that struggle to effectively leverage modality-specific representations from heterogeneous data such as electronic health records and biosignals. To address this limitation, this work proposes LeMoF, a novel framework that employs a hierarchical guidance mechanism to dynamically select and fuse multi-granularity encoder representations within each modality, while simultaneously learning both global modality-level predictions and hierarchy-specific discriminative features. By moving beyond conventional static fusion paradigms and integrating multi-task learning, LeMoF significantly enhances model robustness and discriminative capacity in heterogeneous clinical settings. Extensive experiments on ICU length-of-stay prediction demonstrate that LeMoF consistently outperforms state-of-the-art methods across diverse encoder configurations, underscoring the critical role of hierarchical fusion in advancing clinical prediction performance.
π Abstract
Multimodal clinical prediction is widely used to integrate heterogeneous data such as Electronic Health Records (EHR) and biosignals. However, existing methods tend to rely on static modality integration schemes and simple fusion strategies. As a result, they fail to fully exploit modality-specific representations. In this paper, we propose Level-guided Modal Fusion (LeMoF), a novel framework that selectively integrates level-guided representations within each modality. Each level refers to a representation extracted from a different layer of the encoder. LeMoF explicitly separates and learns global modality-level predictions from level-specific discriminative representations. This design enables LeMoF to achieve a balanced performance between prediction stability and discriminative capability even in heterogeneous clinical environments. Experiments on length of stay prediction using Intensive Care Unit (ICU) data demonstrate that LeMoF consistently outperforms existing state-of-the-art multimodal fusion techniques across various encoder configurations. We also confirmed that level-wise integration is a key factor in achieving robust predictive performance across various clinical conditions.