Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

📅 2024-12-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the joint impact of data missingness and noise on machine learning performance. We systematically quantify trade-offs among data quality, volume, and imputation strategies across two representative scenarios: NLP supervised learning (BERT) and traffic signal control via reinforcement learning (PPO). Methodologically, we propose a novel “Performance Degradation Index Model under Data Corruption” and identify that only 30% of critical data governs overall model performance. We introduce the concepts of “Imputation Advantage Angle” and “Imputation Disadvantage Edge,” and— for the first time—categorize learning tasks into noise-sensitive versus noise-insensitive classes. Results show that noise degrades performance more severely than missingness; imputation efficacy critically depends on alignment between imputation accuracy and data corruption rate; and merely scaling data volume only mitigates—not eliminates—corruption effects, with diminishing marginal returns intensifying as corruption worsens.

Technology Category

Application Category

📝 Abstract

Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in sequential decision-making tasks like Signal-RL. Imputation strategies involve a trade-off: they recover missing information but may introduce noise. Their effectiveness depends on imputation accuracy and corruption ratio. We identify distinct regions in the imputation advantage heatmap, including an"imputation advantageous corner"and an"imputation disadvantageous edge"and classify tasks as"noise-sensitive"or"noise-insensitive"based on their decision boundaries. Furthermore, we find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption. The marginal utility of additional data diminishes as corruption increases. An empirical rule emerges: approximately 30% of the data is critical for determining performance, while the remaining 70% has minimal impact. These findings provide actionable insights into data preprocessing, imputation strategies, and data collection practices, guiding the development of robust machine learning systems in noisy environments.

Problem

Research questions and friction points this paper is trying to address.

Machine Learning

Data Quality

Data Imputation Strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Quality and Quantity

Noise Robustness in Machine Learning

Traffic Signal Optimization with Deep Learning

🔎 Similar Papers

No similar papers found.

Authors to Follow