🤖 AI Summary
To address model performance degradation caused by data and concept drift in MLOps, as well as the high computational overhead and strong model dependency of existing maintenance approaches, this paper proposes a lightweight model reuse mechanism grounded in distributional similarity. Methodologically, it introduces, for the first time, temporal distribution periodicity pattern recognition to enable retrieval and reuse of historical models under cross-temporal distributionally similar conditions. It establishes an end-to-end, integrated MLOps maintenance paradigm comprising distribution similarity measurement, pattern mining, model version indexing, and retrieval—implemented in the open-source SimReuse toolkit (Python + MLflow). Evaluated on four real-world time-series datasets, the approach achieves prediction accuracy comparable to the best baseline while reducing computational time and cost by 93% (i.e., 15× speedup), demonstrating a cost-effective, low-coupling, and scalable paradigm for continuous model maintenance.
📝 Abstract
In recent years, many industries have utilized machine learning models (ML) in their systems. Ideally, machine learning models should be trained on and applied to data from the same distributions. However, the data evolves over time in many application areas, leading to data and concept drift, which in turn causes the performance of the ML models to degrade over time. Therefore, maintaining up to date ML models plays a critical role in the MLOps pipeline. Existing ML model maintenance approaches are often computationally resource intensive, costly, time consuming, and model dependent. Thus, we propose an improved MLOps pipeline, a new model maintenance approach and a Similarity Based Model Reuse (SimReuse) tool to address the challenges of ML model maintenance. We identify seasonal and recurrent distribution patterns in time series datasets throughout a preliminary study. Recurrent distribution patterns enable us to reuse previously trained models for similar distributions in the future, thus avoiding frequent retraining. Then, we integrated the model reuse approach into the MLOps pipeline and proposed our improved MLOps pipeline. Furthermore, we develop SimReuse, a tool to implement the new components of our MLOps pipeline to store models and reuse them for inference of data segments with similar data distributions in the future. Our evaluation results on four time series datasets demonstrate that our model reuse approach can maintain the performance of models while significantly reducing maintenance time and costs. Our model reuse approach achieves ML performance comparable to the best baseline, while being 15 times more efficient in terms of computation time and costs. Therefore, industries and practitioners can benefit from our approach and use our tool to maintain the performance of their ML models in the deployment phase to reduce their maintenance costs.