🤖 AI Summary
To address challenges in cloud-native communication/networking services—including ambiguous SLI/SLO definitions, high expertise barriers for specialized monitoring, and low cross-organizational trust in metrics—this paper proposes the first SRE platform integrating generative AI, federated learning, and blockchain. Methodologically, it introduces federated learning for collaborative, privacy-preserving SLI metric discovery across distributed environments; employs QLoRA-finetuned Llama-3-8B to enable intelligent, context-aware SLI/SLO generation; and leverages smart contracts and NFTs on-chain to immutably attest and audit metrics. The platform is compatible with Prometheus/Mimir, supports lightweight deployment, and was validated on Open5GS 5G core network, demonstrating effective automated SLO management. It simultaneously ensures data privacy, system transparency, and engineering practicality.
📝 Abstract
Software services are crucial for reliable communication and networking, therefore, Site Reliability Engineering (SRE) is important to ensure these systems stay reliable and perform well in cloud-native environments. SRE leverages tools like Prometheus and Grafana to monitor system metrics, defining critical Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for maintaining high service standards. However, a significant challenge arises as many developers often lack indepth understanding of these tools and the intricacies involved in defining appropriate SLIs and SLOs. To bridge this gap, we propose a novel SRE platform, called “SRE-Llama”, enhanced by Generative-AI, Federated Learning, Blockchain and NonFungible Tokens (NFTs). This platform aims to automate and simplify the process of monitoring, SLI/SLO generation, and alert management, offers ease in accessibility and efficy for developers. The automation processes are governed by smart contracts on the Blockchain, ensuring transparency and security. The system operates by capturing metrics from cloud-native services and storing them in a time-series database, like Prometheus and Mimir. Utilizing this stored data, our platform employs Federated Learning models to identify the most relevant and impactful SLI metrics for different services and SLO objectives values, addressing concerns around data privacy and decentralized data sources. Subsequently, custom-trained Meta’s Llama-3 LLM is adopted to intelligently generate SLIs, SLOs, Error-budgets, and associated alerting mechanisms based on these identified SLI metrics. The Llama-3-8B LLM has been quantized and finetuned using Quantized Low-Rank Adaptation (QLoRA) to ensure optimal performance on consumer-grade hardware. A unique aspect of our platform is the encoding of generated SLIs and SLOs as NFT objects, which are then stored on a Blockchain. This feature provides immutable record-keeping and facilitates easy verification and auditing of the SRE metrics and objectives. It enhances the traceability and accountability of the SRE processes, offering a verifiable and transparent record of the system’s performance standards. The proposed SRE-Llama platform prototype has been implemented with a use case featuring a customized Open5GS 5G Core.