🤖 AI Summary
Electronic health records for rare diseases are typically high-dimensional, sparse, and limited in sample size, posing significant challenges for learning effective low-dimensional patient embeddings. To address this, this work proposes an unsupervised spectral representation learning framework that relaxes the conventional one-to-one signal alignment assumption. The method employs a two-stage spectral embedding strategy to separately recover both shared and disease-specific components across populations, enabling flexible knowledge transfer within a partially overlapping subspace. By integrating knowledge matrix denoising with projection-based decomposition, the approach substantially enhances embedding quality for rare disease cohorts. Experiments on both simulated data and real-world multiple sclerosis cohorts demonstrate that the proposed method consistently outperforms existing techniques, particularly in challenging scenarios characterized by weak shared signals and incomplete alignment.
📝 Abstract
We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data.