RPU -- A Reasoning Processing Unit

πŸ“… 2026-02-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inefficiency and low utilization of large language model (LLM) inference under the β€œmemory wall,” particularly in scenarios involving long outputs, low computational intensity, and stringent latency constraints. To overcome these limitations, the authors propose RPU, a chiplet-based architecture optimized for inference, which innovatively integrates capacity-optimized HBM (HBM-CO), a bandwidth-first scalable chiplet design, and a microarchitecture that decouples computation, memory, and communication. This co-design significantly enhances memory bandwidth utilization. Evaluated on the Llama3-405B model, RPU achieves up to a 45.3Γ— reduction in latency and an 18.6Γ— improvement in throughput compared to an H100 system under the same thermal design power (TDP) envelope.

Technology Category

Application Category

πŸ“ Abstract
Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
Problem

Research questions and friction points this paper is trying to address.

memory wall
large language model
inference performance
memory bandwidth
reasoning LLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Processing Unit
Memory Wall
Chiplet Architecture
High-Bandwidth Memory
Decoupled Microarchitecture
πŸ”Ž Similar Papers
No similar papers found.