Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gathe

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

271K/year

🤖 AI Summary

Fine-grained random memory accesses in graph processing cause severe off-chip bandwidth underutilization. Existing accelerators—based on graph partitioning or processing-in-memory (PIM)—suffer from memory access granularity mismatch, low cache and bandwidth utilization, poor synergy between partitioning and PIM, and high cost/limited functionality of compute-capable PIM hardware. Method: This paper proposes an end-to-end memory-access–centric acceleration architecture. It introduces, for the first time, a fine-grained in-memory scatter-gather mechanism that eliminates the need for off-die compute units. It jointly exploits graph partitioning–enabled data reuse and in-memory computation, co-optimizing the cache–memory hierarchy and coherence protocol, while incorporating DDR-aware low-overhead memory access compression and hierarchical scheduling. Results: Evaluated on diverse large-scale graph benchmarks, the design achieves up to 3.28× speedup and a geometric mean improvement of 1.62×, significantly reducing off-chip traffic and bandwidth waste.

Technology Category

Application Category

📝 Abstract

Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the memory bottleneck. In the tiling-based approach, a graph is split into chunks that fit within the on-chip cache to maximize data reuse. In the PIM approach, arithmetic units are placed within memory to perform operations such as reduction or atomic addition. However, both approaches have several limitations, especially when implemented on current memory standards (i.e., DDR). Because the access granularity provided by DDR is much larger than that of the graph vertex property data, much of the bandwidth and cache capacity are wasted. PIM is meant to alleviate such issues, but it is difficult to use in conjunction with the tiling-based approach, resulting in a significant disadvantage. Furthermore, placing arithmetic units inside a memory chip is expensive, thereby supporting multiple types of operation is thought to be impractical. To address the above limitations, we present Piccolo, an end-to-end efficient graph processing accelerator with fine-grained in-memory random scatter-gather. Instead of placing expensive arithmetic units in off-chip memory, Piccolo focuses on reducing the off-chip traffic with non-arithmetic function-in-memory of random scatter-gather. To fully benefit from in-memory scatter-gather, Piccolo redesigns the cache and MHA of the accelerator such that it can enjoy both the advantage of tiling and in-memory operations. Piccolo achieves a maximum speedup of 3.28$ imes$ and a geometric mean speedup of 1.62$ imes$ across various and extensive benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Inefficient data access in graph processing due to off-chip memory architecture.

Limitations of graph tiling and PIM approaches in current memory standards.

High cost and impracticality of placing arithmetic units in memory chips.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained in-memory scatter-gather technique

Redesigns cache and MHA for tiling benefits

Reduces off-chip traffic with non-arithmetic functions

🔎 Similar Papers

No similar papers found.