🤖 AI Summary
Conventional data discovery methods, relying on a single predefined quality criterion, introduce bias into downstream AI models. Method: This paper proposes a skyline-based dataset generation framework for multi-objective optimization, automatically constructing Pareto-optimal data subsets by jointly optimizing user-specified model performance metrics—including accuracy, robustness, and generalizability. Contribution/Results: We introduce MODis, a novel Multi-Objective Finite-State Transducer modeling framework; design three algorithmic strategies—“full-set reduction,” bidirectional alternating addition/removal, and diversity enhancement—to overcome model bias induced by scalar metrics; and integrate Pareto skyline computation, data pruning, and diversity-aware sampling for efficient, scalable dataset discovery across heterogeneous, multi-source settings. Experiments demonstrate significant improvements in the synergistic optimization of multiple model performance objectives.
📝 Abstract
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a"reduce-from-universal"strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.