Monitoring Machine Learning Systems: A Multivocal Literature Review

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Runtime issues—such as data drift and system changes—in dynamic production environments frequently degrade ML model performance, undermining system reliability and user trust. To address this, we conduct a multi-vocal literature review guided by Garousi et al.’s methodology, systematically synthesizing 136 academic papers and industrial grey literature—the first such comprehensive integration to bridge the research-practice gap in ML monitoring. Our analysis clarifies core motivations, key metrics, technical tools, and application contexts; identifies three recurrent challenges and critical research gaps; proposes an evolution path for reliability-oriented monitoring frameworks; and delivers actionable, practice-ready recommendations. This work provides both theoretical foundations for advancing ML monitoring research and practical guidance for building robust MLOps systems in industry. (149 words)

Technology Category

Application Category

📝 Abstract
Context: Dynamic production environments make it challenging to maintain reliable machine learning (ML) systems. Runtime issues, such as changes in data patterns or operating contexts, that degrade model performance are a common occurrence in production settings. Monitoring enables early detection and mitigation of these runtime issues, helping maintain users' trust and prevent unwanted consequences for organizations. Aim: This study aims to provide a comprehensive overview of the ML monitoring literature. Method: We conducted a multivocal literature review (MLR) following the well established guidelines by Garousi to investigate various aspects of ML monitoring approaches in 136 papers. Results: We analyzed selected studies based on four key areas: (1) the motivations, goals, and context; (2) the monitored aspects, specific techniques, metrics, and tools; (3) the contributions and benefits; and (4) the current limitations. We also discuss several insights found in the studies, their implications, and recommendations for future research and practice. Conclusion: Our MLR identifies and summarizes ML monitoring practices and gaps, emphasizing similarities and disconnects between formal and gray literature. Our study is valuable for both academics and practitioners, as it helps select appropriate solutions, highlights limitations in current approaches, and provides future directions for research and tool development.
Problem

Research questions and friction points this paper is trying to address.

Monitoring ML systems in dynamic production environments
Detecting runtime issues degrading model performance
Providing comprehensive overview of ML monitoring literature
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multivocal literature review method applied
Analyzed 136 papers on ML monitoring
Identified monitoring practices and gaps
🔎 Similar Papers
No similar papers found.
H
Hira Naveed
Monash University, Australia
S
Scott Barnett
Deakin University, Australia
C
Chetan Arora
Monash University, Australia
J
John Grundy
Monash University, Australia
Hourieh Khalajzadeh
Hourieh Khalajzadeh
Senior Lecturer at Deakin University
Human-Centred Software EngineeringVisual LanguagesMachine LearningDeep Learning
O
Omar Haggag
Monash University, Australia