Data Requirement Goal Modeling for Machine Learning Systems

πŸ“… 2025-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge non-experts face in systematically identifying data requirements for machine learning systems, this paper proposes a goal-oriented modeling approach for data requirement elicitation, introducing the customizable Data Requirement Goal Model (DRGM). DRGM integrates Goal-Question-Metric (GQM)-inspired GRL-based modeling with a dynamic customization mechanism driven by gray and white literature, enabling flexible configuration by task type, KPI metrics, and goal weights. It supports structured assessment of data attribute quality and contextual suitability. As the first dedicated goal model targeting ML-specific data requirements, DRGM bridges a critical methodological gap in involving non-experts in data requirement engineering. Empirical validation across two real-world projects demonstrates high alignment between DRGM-derived requirements and actual needs, significantly improving non-experts’ accuracy, interpretability, and decision-support capability in specifying data requirements and comparing alternative datasets.

Technology Category

Application Category

πŸ“ Abstract
Machine Learning (ML) has been integrated into various software and systems. Two main components are essential for training an ML model: the training data and the ML algorithm. Given the critical role of data in ML system development, it has become increasingly important to assess the quality of data attributes and ensure that the data meets specific requirements before its utilization. This work proposes an approach to guide non-experts in identifying data requirements for ML systems using goal modeling. In this approach, we first develop the Data Requirement Goal Model (DRGM) by surveying the white literature to identify and categorize the issues and challenges faced by data scientists and requirement engineers working on ML-related projects. An initial DRGM was built to accommodate common tasks that would generalize across projects. Then, based on insights from both white and gray literature, a customization mechanism is built to help adjust the tasks, KPIs, and goals' importance of different elements within the DRGM. The generated model can aid its users in evaluating different datasets using GRL evaluation strategies. We then validate the approach through two illustrative examples based on real-world projects. The results from the illustrative examples demonstrate that the data requirements identified by the proposed approach align with the requirements of real-world projects, demonstrating the practicality and effectiveness of the proposed framework. The proposed dataset selection customization mechanism and the proposed DRGM are helpful in guiding non-experts in identifying the data requirements for machine learning systems tailored to a specific ML problem. This approach also aids in evaluating different dataset alternatives to choose the optimum dataset for the problem. For future work, we recommend implementing tool support to generate the DRGM based on a chatbot interface.
Problem

Research questions and friction points this paper is trying to address.

Guiding non-experts in identifying ML data requirements using goal modeling
Evaluating dataset quality and alignment with project-specific ML needs
Customizing data requirement models for tailored ML system development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal modeling for ML data requirements
Customizable Data Requirement Goal Model
GRL evaluation for dataset assessment
πŸ”Ž Similar Papers
No similar papers found.
A
Asma Yamani
King Fahd University of Petroleum and Minerals, Dhahran, KSA
N
N. AlAmoudi
King Fahd University of Petroleum and Minerals, Dhahran, KSA
S
Salma Albilali
King Fahd University of Petroleum and Minerals, Dhahran, KSA
Malak Baslyman
Malak Baslyman
Assistant Professor, KFUPM
Jameleddine Hassine
Jameleddine Hassine
Associate Professor of Computer Science, KFUPM
AI for Software EngineeringDependability AnalysisFormal methodsNatural Language Processing