🤖 AI Summary
Traditional canonical correlation analysis (CCA) fails to capture inherent structured dependencies among cross-group variables, leading to biased association estimates. To address this, we propose graph-structured CCA (gCCA), the first CCA framework that explicitly incorporates variable topology—encoded as a prior graph—via graph-regularized constraints while preserving mult-omics association modeling capability. Theoretically, we establish finite-sample concentration inequalities and stopping-time convergence guarantees using martingale theory. Algorithmically, gCCA enables interpretable identification of both positive and negative regulatory pathways. Experiments demonstrate that gCCA significantly outperforms state-of-the-art CCA methods on synthetic data and successfully disentangles DNA methylation–RNA-seq associations, revealing bidirectional regulatory mechanisms whereby methylation modulates gene expression pathways.
📝 Abstract
Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their ability to incorporate structured patterns in the cross-correlation matrix, potentially leading to suboptimal estimations. To address this limitation, we propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix between the two sets of variables. We develop computationally efficient algorithms for gCCA, and provide theoretical results for finite sample analysis of best subset selection and canonical correlation estimation by introducing concentration inequalities and stopping time rule based on martingale theories. Extensive simulations demonstrate that gCCA outperforms competing CCA methods. Additionally, we apply gCCA to a multiomics dataset of DNA methylation and RNA-seq transcriptomics, identifying both positively and negatively regulated gene expression pathways by DNA methylation pathways.