🤖 AI Summary
To address attribute missingness in microsamples—arising from privacy constraints or data collection limitations—this paper proposes a Wasserstein generative adversarial network (WGAN)-based population synthesis method that explicitly encodes missingness patterns via mask matrices. Integrated into the WGAN training framework, these masks enable direct learning of the joint distribution from incomplete microdata without imputation or preprocessing, overcoming the fundamental limitation of conventional synthesis methods that require complete observations. Experiments on Sweden’s national travel survey demonstrate that the synthesized populations achieve high fidelity to the ground-truth data in both marginal and joint distributions. Notably, performance matches that of baseline models trained on fully observed data, confirming the method’s robustness and practicality under realistic data incompleteness. This advance significantly enhances the applicability of population synthesis in privacy-sensitive or data-scarce settings.
📝 Abstract
This paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples. By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information. The proposed method aims to address the challenge of missing information in microsamples on one or more attributes due to privacy concerns or data collection constraints. The paper contrasts WGAN models trained on incomplete microsamples with those trained on complete microsamples, creating a synthetic population. We conducted a series of evaluations of the proposed method using a Swedish national travel survey. We validate the efficacy of the proposed method by generating synthetic populations from all the models and comparing them to the actual population dataset. The results from the experiments showed that the proposed methodology successfully generates synthetic data that closely resembles a model trained with complete data as well as the actual population. The paper contributes to the field by providing a robust solution for population synthesis with incomplete data, opening avenues for future research, and highlighting the potential of deep generative models in advancing population synthesis capabilities.