Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional memory protection schemes suffer from redundant protection and high storage overhead when combining error-correcting codes (ECC) for high-error-rate memory with cross-domain replication. Method: This paper proposes a decoupled two-layer memory protection mechanism: lightweight ECC within fault domains (e.g., NUMA nodes) and on-demand replication across domains (e.g., rack-level). We design RAMP, a novel modeling framework that jointly optimizes ECC strength, replication granularity, and replica placement to eliminate protection overlap; further, we introduce a fine-grained replica cost allocation strategy that dynamically minimizes redundancy while guaranteeing end-to-end robustness. Contribution/Results: Experiments show that our approach reduces protection storage overhead from 27% to 17.7%, with negligible performance impact, significantly outperforming state-of-the-art redundancy-based protection schemes.

Technology Category

Application Category

📝 Abstract
As memory technologies continue to shrink and memory error rates increase, the demand for stronger reliability becomes increasingly critical. Fine-grain memory replication has emerged as an appealing approach to improving memory fault tolerance by augmenting conventional memory protection based on error-correcting codes with an additional layer of redundancy that replicates data across independent failure domains, such as replicating memory pages across different NUMA sockets. This method can tolerate a broad spectrum of memory errors, from individual memory cell failures to more complex memory controller failures. However, applying memory replication without a holistic consideration of the interaction between error-correcting codes and replication can result in redundant duplication and unnecessary storage overhead. We propose Replication-Aware Memory-error Protection (RAMP), a model that helps explore error protection strategies to improve the storage efficiency of memory protection in memory systems that utilize memory replication for performance and availability. We use RAMP to determine a protection strategy that can lower the storage cost of individual replicas while still ensuring robust protection through the collective protection conferred by multiple replicas. Our evaluation shows that a solution derived with RAMP enhances the storage efficiency of a state-of-the-art memory protection mechanism when paired with rack-level replication for disaggregated memory. Specifically, we can reduce the storage cost of memory protection from 27% down to 17.7% with minimal performance overhead.
Problem

Research questions and friction points this paper is trying to address.

Improve memory fault tolerance
Reduce storage overhead
Enhance storage efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replication-Aware Memory-error Protection (RAMP)
Memory replication across NUMA sockets
Reduced storage cost to 17.7%
🔎 Similar Papers
No similar papers found.