Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a common yet critical issue in machine learning code: semantic errors arising from mismatches between data properties and model assumptions—such as applying scale-sensitive models to unnormalized data—which traditional debugging approaches can only detect after training, resulting in inefficiency. To enable early and automatic error detection, the authors propose a novel data-aware static analysis method that integrates dataflow and control-flow analysis with API specifications, thereby incorporating data semantics directly into the static analysis framework for the first time. Evaluation on real-world machine learning notebooks demonstrates that the approach effectively identifies subtle semantic bugs that conventional techniques fail to catch, highlighting its practical utility and methodological innovation.
📝 Abstract
Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.
Problem

Research questions and friction points this paper is trying to address.

semantic faults
machine learning code
data-aware analysis
static analysis
scale-sensitive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

data-aware static analysis
semantic faults
machine learning code
data flow analysis
API contracts
🔎 Similar Papers
No similar papers found.