🤖 AI Summary
This work addresses the limitations of existing CXL memory tiering solutions, which lack multi-tenancy support, fairness guarantees, and fine-grained observability—critical requirements for meeting service-level objectives (SLOs) in large-scale data centers. The authors propose and implement an operating system–level framework that introduces, for the first time in the Linux kernel, a multi-tenant-aware CXL memory tiering mechanism. This framework enables container-granularity fair memory allocation, user-configurable policies, controlled page promotion and demotion, and thrashing mitigation. Evaluated in real production environments, the solution significantly improves both performance and isolation, achieving up to a 52% performance gain on production workloads and a 1.7× speedup on benchmarks compared to Linux’s latest Transparent Page Placement (TPP) scheme, while effectively preserving SLOs. The implementation has been contributed back to the open-source community.
📝 Abstract
Memory dominates datacenter system cost and power. Memory expansion via Compute Express Link (CXL) is an effective way to provide additional memory at lower cost and power, but its effective use requires software-level tiering for hyperscaler workloads. Existing tiering solutions, including current Linux support, face fundamental limitations in production deployments. First, they lack multi-tenancy support, failing to handle stacked homogeneous or heterogeneous workloads. Second, limited control-plane flexibility leads to fairness violations and performance variability. Finally, insufficient observability prevents operators from diagnosing performance pathologies at scale. We present Equilibria, an OS framework enabling fair, multi-tenant CXL tiering at datacenter scale. Equilibria provides per-container controls for memory fair-share allocation and fine-grained observability of tiered-memory usage and operations. It further enforces flexible, user-specified fairness policies through regulated promotion and demotion, and mitigates noisy-neighbor interference by suppressing thrashing. Evaluated in a large hyperscaler fleet using production workloads and benchmarks, Equilibria helps workloads meet service level objectives (SLOs) while avoiding performance interference. It improves performance over the state-of-the-art Linux solution, TPP, by up to 52% for production workloads and 1.7x for benchmarks. All Equilibria patches have been released to the Linux community.