The Vocabulary of Flaky Tests in the Context of SAP HANA

📅 2023-10-26

🏛️ International Symposium on Empirical Software Engineering and Measurement

📈 Citations: 5

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study addresses the lack of effective methods for identifying and diagnosing flaky tests in industrial-scale software systems. It presents the first validation and enhancement of lexical-based flaky test detection within the large-scale industrial environment of SAP HANA. By integrating TF-IDF and TF-IDFC-RF feature extraction with CodeBERT and XGBoost classification models, the approach achieves F1 scores of 0.96 and 0.99 on the original benchmark and SAP HANA datasets, respectively. The research identifies external dependencies as the primary root cause of flakiness in SAP HANA and systematically evaluates the effectiveness of various features and models in real-world industrial settings. These findings provide a robust empirical foundation for the automated diagnosis of flaky tests, offering practical insights for improving test reliability in complex software systems.

Technology Category

Application Category

📝 Abstract

Background. Automated test execution is an important activity to gather information about the quality of a software project. So-called flaky tests, however, negatively affect this process. Such tests fail seemingly at random without changes to the code and thus do not provide a clear signal. Previous work proposed to identify flaky tests based on the source code identifiers in the test code. So far, these approaches have not been evaluated in a large-scale industrial setting. Aims. We evaluate approaches to identify flaky tests and their root causes based on source code identifiers in the test code in a large-scale industrial project. Method. First, we replicate previous work by Pinto et al. in the context of SAP HANA. Second, we assess different feature extraction techniques, namely TF-IDF and TF-IDFC-RF. Third, we evaluate CodeBERT and XGBoost as classification models. For a sound comparison, we utilize both the data set from previous work and two data sets from SAP HANA. Results. Our replication shows similar results on the original data set and on one of the SAP HANA data sets. While the original approach yielded an F1-Score of 0.94 on the original data set and 0.92 on the SAP HANA data set, our extensions achieve F1-Scores of 0.96 and 0.99, respectively. The reliance on external data sources is a common root cause for test flakiness in the context of SAP HANA. Conclusions. The vocabulary of a large industrial project seems to be slightly different with respect to the exact terms, but the categories for the terms, such as remote dependencies, are similar to previous empirical findings. However, even with rather large F1-Scores, both finding source code identifiers for flakiness and a black box prediction have limited use in practice as the results are not actionable for developers.

Problem

Research questions and friction points this paper is trying to address.

flaky tests

automated testing

test flakiness

industrial software

source code identifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

flaky tests

source code identifiers

CodeBERT