Prefetching in Deep Memory Hierarchies with NVRAM as Main Memory

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the latency overhead and performance bottlenecks imposed by heterogeneous memory controllers in NVRAM-based main memory systems—particularly for big-data and AI workloads—this paper proposes a two-level collaborative prefetching architecture spanning hybrid memory cube (HMC) and on-chip L1 cache. We design a novel multi-level prefetching engine that enables dynamic, synergistic optimization between HMC- and L1-level prefetchers on out-of-order execution processors. Experimental evaluation shows that HMC-only prefetching improves performance by 9%, while the joint HMC+L1 scheme achieves a 12% gain. Prefetch coverage reaches up to 92%, with accuracy improved to 80%. Crucially, this work is the first to reveal the pivotal role of the L1 prefetcher in significantly enhancing off-chip prefetch coverage—a key insight enabling low-latency, high-efficiency prefetching across deep memory hierarchies. Our approach establishes a new paradigm for coordinated, hierarchy-aware prefetching in emerging NVRAM-centric systems.

Technology Category

Application Category

📝 Abstract

Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent research has focused on off-chip memory controllers that facilitate access to diverse memory media, each with unique density and latency characteristics. While these solutions improve memory system performance, they also exacerbate the already significant memory latency. As a result, multi-level prefetching techniques are essential to mitigate these extended latencies. This paper investigates the advantages of prefetching across both sides of the memory system: the off-chip memory and the on-chip cache hierarchy. Our primary objective is to assess the impact of a multi-level prefetching engine on overall system performance. Additionally, we analyze the individual contribution of each prefetching level to system efficiency. To achieve this, the study evaluates two key prefetching approaches: HMC (Hybrid Memory Controller) and HMC+L1, both of which employ prefetching mechanisms commonly used by processor vendors. The HMC approach integrates a prefetcher within the off-chip hybrid memory controller, while the HMC+L1 approach combines this with additional L1 on-chip prefetchers. Experimental results on an out-of-order execution processor show that on-chip cache prefetchers are crucial for maximizing the benefits of off-chip prefetching, which in turn further enhances performance. Specifically, the off-chip HMC prefetcher achieves coverage and accuracy rates exceeding 60% and up to 80%, while the combined HMC+L1 approach boosts off-chip prefetcher coverage to as much as 92%. Consequently, overall performance increases from 9% with the HMC approach to 12% when L1 prefetching is also employed.

Problem

Research questions and friction points this paper is trying to address.

Addressing increased memory latency in systems using NVRAM as main memory

Evaluating multi-level prefetching across off-chip memory and on-chip cache hierarchy

Assessing performance impact of combined HMC and L1 prefetching approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level prefetching across off-chip and on-chip memory

Combining HMC off-chip prefetcher with L1 on-chip prefetchers

Achieving 92% coverage through coordinated prefetching levels

🔎 Similar Papers

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling