Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of benchmark datasets with controllable complexity for machine learning evaluation, this paper proposes a synthetic data generation framework based on linear feature projection and a multi-objective genetic algorithm (MOGA). The method jointly optimizes ten classification and four regression complexity metrics, enabling targeted, interpretable control over dataset difficulty. Unlike conventional heuristic approaches, our framework is the first to unify and co-optimize complexity modeling across both classification and regression tasks. Experiments demonstrate that the generated datasets systematically span the full spectrum from low to high complexity, and model performance—measured by accuracy for classification and RMSE for regression—exhibits strict monotonic correlation with the controlled complexity metrics. This significantly enhances the reliability and attributability of empirical evaluations. The source code and generated datasets are publicly available.

Technology Category

Application Category

📝 Abstract
The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.
Problem

Research questions and friction points this paper is trying to address.

Generate datasets with target complexity for ML evaluation
Optimize complexity measures for classification and regression tasks
Create diverse problem difficulties via linear feature projections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Genetic algorithm optimizes problem complexity measures
Linear feature projections transform synthetic datasets
Targets classification and regression complexity metrics
🔎 Similar Papers
No similar papers found.