Data Privatization in Vertical Federated Learning with Client-wise Missing Problem

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address privacy leakage and statistical distortion arising from entire feature-block missingness in vertical federated learning (VFL) under MCAR/MAR mechanisms, this paper proposes the first Gaussian copula-based privacy-preserving distributed data sharing framework. We introduce vertical distributional attribute differential privacy (VDADP) and design three algorithms—VCDS, EVCDS, and its iterative variant IEVCDS—that support heterogeneous data types and task-agnostic collaborative modeling. Our approach employs debiased randomized response to estimate private correlation matrices and combines nonparametric marginal distribution estimation to securely infer copula parameters under missingness. We theoretically establish consistency of generalized linear model (GLM) coefficient estimation and variable selection. Experiments on synthetic and real-world datasets demonstrate significant improvements over baselines, achieving high statistical utility even under strong privacy guarantees (ε ≤ 2).

Technology Category

Application Category

📝 Abstract
Vertical Federated Learning (VFL) often suffers from client-wise missingness, where entire feature blocks from some clients are unobserved, and conventional approaches are vulnerable to privacy leakage. We propose a Gaussian copulabased framework for VFL data privatization under missingness constraints, which requires no prior specification of downstream analysis tasks and imposes no restriction on the number of analyses. To privately estimate copula parameters, we introduce a debiased randomized response mechanism for correlation matrix estimation from perturbed ranks, together with a nonparametric privatized marginal estimation that yields consistent CDFs even under MAR. The proposed methods comprise VCDS for MCAR data, EVCDS for MAR data, and IEVCDS, which iteratively refines copula parameters to mitigate MAR-induced bias. Notably, EVCDS and IEVCDS also apply under MCAR, and the framework accommodates mixed data types, including discrete variables. Theoretically, we introduce the notion of Vertical Distributed Attribute Differential Privacy (VDADP), tailored to the VFL setting, establish corresponding privacy and utility guarantees, and investigate the utility of privatized data for GLM coefficient estimation and variable selection. We further establish asymptotic properties including estimation and variable selection consistency for VFL-GLMs. Extensive simulations and a real-data application demonstrate the effectiveness of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Addresses client-wise data missingness in Vertical Federated Learning
Proposes privacy-preserving methods for data under missing constraints
Ensures privacy protection while maintaining statistical utility guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian copula framework for VFL data privatization
Debiased randomized response for correlation estimation
Nonparametric marginal estimation under missingness constraints
🔎 Similar Papers
Huiyun Tang
Huiyun Tang
University of Luxembourg
Human-computer interactionmisinformation
Long Feng
Long Feng
Professor of Nankai University
High Dimensional DataHigh Frequency Data
Y
Yang Li
Center for Applied Statistics, Renmin University of China, Beijing, China; School of Statistics, Renmin University of China, Beijing, China
F
Feifei Wang
Center for Applied Statistics, Renmin University of China, Beijing, China; School of Statistics, Renmin University of China, Beijing, China