🤖 AI Summary
Traditional sample covariance estimation is sensitive to both casewise and cellwise outliers and struggles with high-dimensional data and missing values. Existing robust estimators are limited to low dimensions (≤20), lacking scalability and theoretical guarantees. This paper proposes *cellRCov*, the first scalable robust covariance estimator capable of simultaneously handling casewise outliers, cellwise outliers, and missing data in high dimensions. Its core innovations integrate robust principal component analysis, principal/orthogonal subspace decomposition, and ridge regularization—enabling, for the first time, rigorous theoretical guarantees on bounded influence function, consistency, and asymptotic normality. Extensive experiments demonstrate that *cellRCov* significantly outperforms state-of-the-art methods under combined contamination and missingness. Moreover, it enables high-dimensional robust canonical correlation analysis (*cellRCCA*) and achieves superior performance in real-world anomaly detection tasks.
📝 Abstract
The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.