🤖 AI Summary
Predicting data transfer performance in scientific computing networks faces severe class imbalance, particularly due to the scarcity of slow-transfer samples. Method: This paper systematically evaluates generative data augmentation techniques—SMOTE, ADASYN, and CTGAN—in conjunction with logistic regression, XGBoost, and LSTM models, under multiple imbalance ratios. Contribution/Results: (1) Generative augmentation yields only marginal improvements for slow-transfer prediction, with diminishing returns as class imbalance intensifies; (2) CTGAN does not significantly outperform simple stratified sampling; (3) We propose a lightweight, efficient prediction paradigm grounded in stratified sampling as a baseline. This work challenges the prevailing assumption that generative models are necessary for this task, establishing a reproducible, low-overhead benchmark for early network performance prediction.
📝 Abstract
Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.