Scale-Calibrated Median-of-Means for Robust Distributed Principal Component Analysis

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

224K/year
๐Ÿค– AI Summary
This work addresses the challenge in distributed principal component analysis where heterogeneity across nodes in both mean vectors and principal subspace estimates complicates robust aggregation, further exacerbated by the mismatch in scales between mean estimation error and subspace estimation error. To resolve this, the authors propose a scale-calibrated median-of-means estimator grounded in the product geometry of Euclidean space and the Grassmann manifold, introducingโ€” for the first timeโ€”an adaptive scale calibration mechanism driven by the eigengap to harmonize the relative scales of the two error types. Theoretically, they establish the asymptotic equivalence of the estimator to a scaled spatial median, derive non-Gaussian limits under fixed node regimes and Gaussian limits as the number of nodes grows, and provide rigorous analysis via influence functions, non-Gaussian limit theory, and high-probability error bounds. Empirical validation on synthetic data and large-scale single-cell RNA-seq datasets demonstrates significant improvements in both robustness and accuracy.
๐Ÿ“ Abstract
Distributed principal component analysis (PCA) produces node-level estimates of both a mean vector and a principal subspace. Robustly aggregating these heterogeneous objects requires a relative scale between mean error and subspace error. We study a scale-calibrated median-of-means estimator for this problem using the product geometry of Euclidean space and the Grassmann manifold. A node-level PCA expansion shows that the mean component has the usual linear influence, whereas the subspace component is an eigengap-weighted covariance perturbation. We prove a local reduction showing that the proposed product-manifold median-of-means estimator is asymptotically equivalent to a scaled spatial median of node influence errors. This yields fixed-node non-Gaussian limits, growing-node Gaussian limits with finite-block bias, and an explicit scale-dependent covariance formula. We propose robust block-scale and inference-optimal calibration rules, establish high-probability median-of-means bounds, characterize factorwise bad-node influence, and prove node-bootstrap validity. Simulations and large-scale single-cell RNA-seq data show that scale calibration adapts to eigengap-driven subspace uncertainty and provides a robust distributed PCA summary.
Problem

Research questions and friction points this paper is trying to address.

distributed PCA
robust aggregation
scale calibration
median-of-means
Grassmann manifold
Innovation

Methods, ideas, or system contributions that make the work stand out.

scale-calibrated median-of-means
distributed PCA
product manifold
eigengap-weighted perturbation
robust aggregation