🤖 AI Summary
To address privacy leakage and statistical distortion arising from entire feature-block missingness in vertical federated learning (VFL) under MCAR/MAR mechanisms, this paper proposes the first Gaussian copula-based privacy-preserving distributed data sharing framework. We introduce vertical distributional attribute differential privacy (VDADP) and design three algorithms—VCDS, EVCDS, and its iterative variant IEVCDS—that support heterogeneous data types and task-agnostic collaborative modeling. Our approach employs debiased randomized response to estimate private correlation matrices and combines nonparametric marginal distribution estimation to securely infer copula parameters under missingness. We theoretically establish consistency of generalized linear model (GLM) coefficient estimation and variable selection. Experiments on synthetic and real-world datasets demonstrate significant improvements over baselines, achieving high statistical utility even under strong privacy guarantees (ε ≤ 2).
📝 Abstract
Vertical Federated Learning (VFL) often suffers from client-wise missingness, where entire feature blocks from some clients are unobserved, and conventional approaches are vulnerable to privacy leakage. We propose a Gaussian copulabased framework for VFL data privatization under missingness constraints, which requires no prior specification of downstream analysis tasks and imposes no restriction on the number of analyses. To privately estimate copula parameters, we introduce a debiased randomized response mechanism for correlation matrix estimation from perturbed ranks, together with a nonparametric privatized marginal estimation that yields consistent CDFs even under MAR. The proposed methods comprise VCDS for MCAR data, EVCDS for MAR data, and IEVCDS, which iteratively refines copula parameters to mitigate MAR-induced bias. Notably, EVCDS and IEVCDS also apply under MCAR, and the framework accommodates mixed data types, including discrete variables. Theoretically, we introduce the notion of Vertical Distributed Attribute Differential Privacy (VDADP), tailored to the VFL setting, establish corresponding privacy and utility guarantees, and investigate the utility of privatized data for GLM coefficient estimation and variable selection. We further establish asymptotic properties including estimation and variable selection consistency for VFL-GLMs. Extensive simulations and a real-data application demonstrate the effectiveness of the proposed framework.