JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Prior work lacks a realistic, crash-oriented benchmark dataset for Python machine learning code in Jupyter notebooks—particularly one that systematically covers debugging challenges such as out-of-order execution and framework interoperability (e.g., TensorFlow/Keras vs. PyTorch). Method: We introduce ML-Notebook-CrashBench, the first standardized, reproducible benchmark comprising 111 crash cases extracted exclusively from verified, real-world ML notebooks on Kaggle. Each case includes human-validated fixes and is rigorously reproduced using containerized execution environments coupled with joint static and dynamic verification. Contribution/Results: ML-Notebook-CrashBench fills a critical experimental gap in defect detection, localization, and repair research for interactive ML development. It enables rigorous, comparable evaluation of debugging and repair tools—significantly advancing empirical validity and reproducibility in this domain.

Technology Category

Application Category

📝 Abstract

Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet few debugging tools are designed for ML code in notebooks, potentially due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench has 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verifiable fix, ranging over popular ML libraries, including TensorFlow/Keras, PyTorch, Scikit-learn, Pandas, and NumPy, as well as notebook-specific out-of-order execution issue. To support reproducibility and ease of use, JunoBench offers a unified execution environment where crashes and fixes can be reliably reproduced. By providing realistic crashes and their resolutions, JunoBench facilitates bug detection, localization, and repair tailored to the interactive and iterative nature of notebook-based ML development.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking real-world crashes in Python ML notebooks

Addressing lack of debugging tools for ML notebook environments

Reproducing crashes and fixes across popular ML libraries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first benchmark dataset for ML notebook crashes

Provides curated reproducible crashes with verifiable fixes

Offers unified execution environment for crash reproduction

🔎 Similar Papers

Predicting the Understandability of Computational Notebooks through Code Metrics Analysis