๐ค AI Summary
Synthetic tabular data often suffer from exacerbated class imbalance and structural mismatch with supervised learning objectives, degrading downstream model performance. To address this, we propose PRROโa novel framework that jointly optimizes data pruning and column reordering. Pruning selects high signal-to-noise-ratio samples to mitigate class imbalance, while reordering restructures feature input sequences to align the generatorโs architecture with supervised learning requirements. Evaluated on 22 public benchmarks, PRRO achieves an average 26.74% improvement in predictive performance (up to 871.46%) and enhances class distribution similarity by 43% on highly imbalanced datasets. The method significantly improves the utility and generalizability of synthetic data, establishing a new paradigm for high-fidelity synthetic tabular data generation.
๐ Abstract
Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.