🤖 AI Summary
This study addresses the critical challenge of alert fatigue in large-scale cloud systems, where excessive alerts severely degrade operational efficiency. To tackle this issue, the authors propose a novel three-stage framework that integrates large language models with lightweight graph learning to span the full lifecycle of alert management—encompassing alert denoising, summary generation, and iterative rule optimization. The approach innovatively combines graph-structured modeling (including virtual noise nodes) with retrieval-augmented generation and introduces a multi-agent feedback mechanism to enable continuous evolution of alert rules. Evaluated on real-world industrial datasets, the method achieves a 94.8% alert reduction rate and 90.5% fault diagnosis accuracy, successfully refining 1,174 alert rules, of which 375 were adopted by the Site Reliability Engineering (SRE) team.
📝 Abstract
Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X’s services, AlertGuardian significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management after the deployment of AlertGuardian in Company-X.