Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries

📅 2024-08-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces and empirically investigates Default Argument Behavior Changes (DABCs)—a novel class of breaking backward compatibility issues in data science libraries. We systematically identify 93 real-world DABCs across scikit-learn, NumPy, and pandas via static API metadata analysis, cross-version comparison, dependency resolution from over 500,000 GitHub projects, and impact propagation modeling. Results reveal highly uneven downstream impact: 35% of scikit-learn-dependent projects are affected, versus only 0.13% for NumPy—highlighting a fundamental tension between API maintainability and behavioral stability. Our study establishes DABCs as a prevalent source of compatibility risk. To mitigate this, we propose actionable, evidence-based guidelines for developers and users—including automated DABC detection, improved documentation practices, and structured migration support. These contributions provide both theoretical grounding and practical engineering tools to support the sustainable evolution of data science libraries.

Technology Category

Application Category

📝 Abstract
Data Science (DS) has become a cornerstone for modern software, enabling data-driven decisions to improve companies services. Following modern software development practices, data scientists use third-party libraries to support their tasks. As the APIs provided by these tools often require an extensive list of arguments to be set up, data scientists rely on default values to simplify their usage. It turns out that these default values can change over time, leading to a specific type of breaking change, defined as Default Argument Breaking Change (DABC). This work reveals 93 DABCs in three Python libraries frequently used in Data Science tasks -- Scikit Learn, NumPy, and Pandas -- studying their potential impact on more than 500K client applications. We find out that the occurrence of DABCs varies significantly depending on the library; 35% of Scikit Learn clients are affected, while only 0.13% of NumPy clients are impacted. The main reason for introducing DABCs is to enhance API maintainability, but they often change the function's behavior. We discuss the importance of managing DABCs in third-party DS libraries and provide insights for developers to mitigate the potential impact of these changes in their applications.
Problem

Research questions and friction points this paper is trying to address.

Identifies Default Argument Breaking Changes (DABCs) in DS libraries
Analyzes impact of DABCs on 500K+ client applications
Compares DABC frequency across Scikit Learn, NumPy, Pandas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies Default Argument Breaking Changes (DABCs)
Analyzes DABCs in Scikit Learn, NumPy, Pandas
Provides insights to mitigate DABCs impact
🔎 Similar Papers
No similar papers found.
João Eduardo Montandon
João Eduardo Montandon
Universidade Federal de Minas Gerais (DCC/UFMG)
Software Engineeringsoftware maintenancemining software repositories
L
Luciana Lourdes Silva
Instituto Federal de Minas Gerais, Ouro Branco, Brazil
Cristiano Politowski
Cristiano Politowski
Ontario Tech University, Oshawa, Canada
D
Daniel Prates
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
A
Arthur de Brito Bonifácio
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Ghizlane El Boussaidi
Ghizlane El Boussaidi
École de Technologie Supérieure, Montréal, Canada