🤖 AI Summary
Online data-intensive applications—such as message queues, databases, and ML inference services—frequently suffer from QoS degradation due to dynamic workloads and kernel-level resource contention. Existing diagnostic approaches, relying on coarse-grained system metrics, fail to pinpoint fine-grained root causes (e.g., lock contention, CPU saturation, or disk I/O bottlenecks). This paper proposes an application-agnostic, general-purpose diagnosis framework. It systematically defines and implements 16 fine-grained eBPF metrics spanning six major kernel subsystems, capturing kernel events in real time via kprobes, uprobes, and tracepoints, with support for multi-dimensional statistical aggregation. Evaluated under controlled, high-stress workload scenarios, the method accurately decomposes performance degradation into constituent factors, distinguishes workload-induced changes from underlying resource bottlenecks, and uncovers architectural constraints inherent to applications. Results demonstrate significantly improved accuracy and generality in QoS problem attribution.
📝 Abstract
Online Data Intensive applications (e.g. message brokers, ML inference and databases) are core components of the modern internet, providing critical functionalities to connecting services. The load variability and interference they experience are generally the main causes of Quality of Service (QoS) degradation, harming depending applications, and resulting in an impaired end-user experience. Uncovering the cause of QoS degradation requires detailed instrumentation of an application's activity. Existing generalisable approaches utilise readily available system metrics that encode interference in kernel metrics, but unfortunately, these approaches lack the required detail to pinpoint granular causes of performance degradation (e.g., lock, disk and CPU contention). In contrast, this paper explores the use of fine-grained system-level metrics to facilitate an application-agnostic diagnosis of QoS degradation. To this end, we introduce and implement $16$ $ extit{eBPF-based metrics}$ spanning over six kernel subsystems, which capture statistics over kernel events that often highlight obstacles impeding an application's progress. We demonstrate the use of our $ extit{eBPF-based metrics}$ through extensive experiments containing a representative set of online data-intensive applications. Results show that the implemented metrics can deconstruct performance degradation when applications face variable workload patterns and common resource contention scenarios, while also revealing applications' internal architecture constraints.