MLOps Monitoring at Scale for Digital Platforms

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Massive, dynamic data streams in digital platforms render conventional ML monitoring methods ineffective or prohibitively costly in manual effort, forcing enterprises to downgrade to simpler models. Method: This paper proposes the Machine Learning Monitoring Agent (MLMA) framework, introducing a test-driven, automated retraining mechanism based on data-adaptive reference loss batches—designed to enable efficient closed-loop operations while preserving human-in-the-loop collaborative governance. The approach integrates design science principles, dynamic reference loss computation, key metric visualization, and human–AI collaborative workflows. Contribution/Results: Evaluated on a large-scale instant-delivery platform, MLMA supports concurrent monitoring of hundreds of models, significantly reduces manual intervention frequency, and sustains long-term online model performance stability. Its core contribution lies in unifying dynamic data adaptation, automated trigger logic, and human–AI collaboration—thereby overcoming critical technical bottlenecks in real-time monitoring and adaptive maintenance of large-scale ML systems.

Technology Category

Application Category

📝 Abstract
Machine learning models are widely recognized for their strong performance in forecasting. To keep that performance in streaming data settings, they have to be monitored and frequently re-trained. This can be done with machine learning operations (MLOps) techniques under supervision of an MLOps engineer. However, in digital platform settings where the number of data streams is typically large and unstable, standard monitoring becomes either suboptimal or too labor intensive for the MLOps engineer. As a consequence, companies often fall back on very simple worse performing ML models without monitoring. We solve this problem by adopting a design science approach and introducing a new monitoring framework, the Machine Learning Monitoring Agent (MLMA), that is designed to work at scale for any ML model with reasonable labor cost. A key feature of our framework concerns test-based automated re-training based on a data-adaptive reference loss batch. The MLOps engineer is kept in the loop via key metrics and also acts, pro-actively or retrospectively, to maintain performance of the ML model in the production stage. We conduct a large-scale test at a last-mile delivery platform to empirically validate our monitoring framework.
Problem

Research questions and friction points this paper is trying to address.

Monitoring ML models in large unstable data streams
Reducing labor-intensive MLOps supervision in digital platforms
Automating re-training to maintain model performance at scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLMA framework for scalable ML monitoring
Test-based automated re-training with adaptive loss
Pro-active MLOps engineer engagement via metrics
🔎 Similar Papers