Convex space learning for tabular synthetic data generation

๐Ÿ“… 2024-07-13
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the fundamental challenge of balancing privacy preservation and data utility in clinical data sharing. We propose NextConvGeNโ€”the first deep generative model to extend convex-space learning to full-table synthetic data generation. NextConvGeN introduces a neighborhood-driven paradigm, leveraging a GAN framework to model local neighborhood convex hulls while jointly enforcing geometric constraints and adversarial training, thereby enhancing fidelity without compromising privacy. Extensive evaluation across ten biomedical datasets demonstrates that NextConvGeN consistently outperforms five state-of-the-art methods: it achieves an average 4.2% improvement in classification F1-score and a 5.8% gain in clustering Adjusted Rand Index (ARI). All utility metrics rank among the best reported, confirming its effectiveness for downstream clinical research, machine learning model development, and clinical decision support.

Technology Category

Application Category

๐Ÿ“ Abstract
Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic tabular data via convex space learning.
Improving classification and clustering with synthetic samples.
Evaluating privacy and utility in synthetic data strategies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep learning for convex space
Generator-discriminator architecture
Synthetic tabular data generation
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Manjunath Mahendra
Institute of Computer Science, University of Rostock, Germany
C
Chaithra Umesh
Institute of Computer Science, University of Rostock, Germany
S
Saptarshi Bej
Indian Institute of Science Education and Research Thiruvananthapuram, India
K
Kristian Schultz
Institute of Computer Science, University of Rostock, Germany
Olaf Wolkenhauer
Olaf Wolkenhauer
Professor for Systems Biology and Bioinformatics
Systems TheoryData Science