🤖 AI Summary
This paper investigates the detectability of shared signals between two high-dimensional variable sets under undersampling regimes. Leveraging random matrix theory, we systematically analyze the signal-resolving capabilities—under additive noise—of three covariance-based estimators: individual autocovariances, cross-covariances, and the joint covariance of concatenated variables. We establish that both cross-covariance and joint covariance estimators surpass the Baik–Bénayoud–Péché phase transition threshold earlier than autocovariance, enabling reliable detection of shared signals; their relative advantage depends critically on the dimensional alignment between the two variable sets. This work provides the first unified characterization of the statistical gain from multivariate collaborative analysis in low signal-to-noise ratio and small-sample settings. It yields theoretical criteria for high-dimensional association inference and principled guidance for estimator selection.
📝 Abstract
Many data-science applications involve detecting a shared signal between two high-dimensional variables. Using random matrix theory methods, we determine when such signal can be detected and reconstructed from sample correlations, despite the background of sampling noise induced correlations. We consider three different covariance matrices constructed from two high-dimensional variables: their individual self covariance, their cross covariance, and the self covariance of the concatenated (joint) variable, which incorporates the self and the cross correlation blocks. We observe the expected Baik, Ben Arous, and Péché detectability phase transition in all these covariance matrices, and we show that joint and cross covariance matrices always reconstruct the shared signal earlier than the self covariances. Whether the joint or the cross approach is better depends on the mismatch of dimensionalities between the variables. We discuss what these observations mean for choosing the right method for detecting linear correlations in data and how these findings may generalize to nonlinear statistical dependencies.