PIUMA: Programmable Integrated Unified Memory Architecture

📅 2020-10-13

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Traditional processors exhibit low resource utilization and poor scalability for large-scale graph analytics. Method: Intel proposes PIUMA—a programmable, integrated, unified-memory architecture leveraging silicon photonics and co-packaged optical interconnects. It introduces a novel optically driven global shared address space, extended on-chip network protocols, and heterogeneous multithreaded cores to realize a “virtual single-chip” system spanning over one thousand sockets. Fabricated in 7 nm FinFET technology, a 316 mm² prototype was tape-out and validated on a 16-node platform using a full-system simulation toolchain. Results: Empirical and projected evaluations show 10×–100× speedup in key graph algorithms versus conventional server nodes. This work pioneers deep integration of optical interconnects into general-purpose graph computing architectures, establishing a scalable hardware–software co-design paradigm for ultra-large-scale graph processing; its core innovations are being incorporated into Intel’s next-generation products.

📝 Abstract

High performance large scale graph analytics are essential to timely analyze relationships in big data sets. Conventional processor architectures suffer from inefficient resource usage and bad scaling on those workloads. To enable efficient and scalable graph analysis, Intel developed the Programmable Integrated Unified Memory Architecture (PIUMA) as a part of the DARPA Hierarchical Identify Verify Exploit (HIVE) program. PIUMA consists of many multi-threaded cores, fine-grained memory and network accesses, a globally shared address space, powerful offload engines and a tightly integrated optical interconnection network. By utilizing co-packaged optical silicon photonics and extending the on-chip mesh protocol directly to the optical fabric, all PIUMA chips in a system are glued together in a large virtual die which allows for extremely low socket-to-socket latencies even as the system scales to thousands of sockets. Performance estimations project that a PIUMA node will outperform a conventional compute node by one to two orders of magnitude. Furthermore, PIUMA continues to scale across multiple nodes, which is a challenge in conventional multi-node setups. This paper presents the PIUMA architecture, and documents our experience in designing and building a prototype chip and its bring-up process. We summarize the methodology for our co-design of the architecture together with the software stack using simulation tools and FPGA emulation. These tools provided early performance estimations of realistic applications and allowed us to implement many optimizations across the hardware, compilers, libraries and applications. We built the PIUMA chip as a 316mm2 7nm FinFET CMOS die and constructed a 16-node system. PIUMA silicon has successfully powered on demonstrating key aspects of the architecture, some of which will be incorporated into future Intel products.

Problem

Research questions and friction points this paper is trying to address.

Enables efficient and scalable large-scale graph analytics.

Addresses inefficiencies in conventional processor architectures.

Integrates optical silicon photonics for low-latency multi-node scaling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated optical silicon photonics for low latency

Multi-threaded cores with fine-grained memory access

Co-design methodology using simulation and FPGA emulation

🔎 Similar Papers

No similar papers found.