🤖 AI Summary
This paper addresses the latency–resource trade-off in performance anomaly detection for cloud services. We propose a modeling and quantitative analysis framework based on Stochastic Reward Nets (SRNs) to formally characterize how detection accuracy, recall, and inspection frequency jointly influence service latency and computational overhead. Our analysis reveals a dynamic trade-off: accuracy dominates the performance–cost balance under high-frequency inspection, whereas recall becomes the critical bottleneck under low-frequency inspection. Experimental evaluation demonstrates that the model accurately quantifies the aggregate cost of diverse detection strategies and provides a theoretical foundation for adaptive parameter tuning. The primary contribution is the first systematic formal modeling and empirical validation of the nonlinear trade-offs among anomaly detection parameters in cloud environments—enabling cost-effective, intelligent monitoring strategy design.
📝 Abstract
Detecting and resolving performance anomalies in Cloud services is crucial for maintaining desired performance objectives. Scaling actions triggered by an anomaly detector help achieve target latency at the cost of extra resource consumption. However, performance anomaly detectors make mistakes. This paper studies which characteristics of performance anomaly detection are important to optimize the trade-off between performance and cost. Using Stochastic Reward Nets, we model a Cloud service monitored by a performance anomaly detector. Using our model, we study the impact of detector characteristics, namely precision, recall and inspection frequency, on the average latency and resource consumption of the monitored service. Our results show that achieving a high precision and a high recall is not always necessary. If detection can be run frequently, a high precision is enough to obtain a good performance-to-cost trade-off, but if the detector is run infrequently, recall becomes the most important.