WaferLLM: A Wafer-Scale LLM Inference System

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM inference systems are designed for GPU-shared memory architectures and thus fail to efficiently utilize wafer-scale AI accelerators—characterized by hundreds of thousands of cores, terabytes of on-chip memory, and petabyte-per-second interconnect bandwidth. This work presents the first LLM inference system co-designed for wafer-scale architectures. Our approach comprises three key innovations: (1) a novel PLMR (Parallel Latency-Memory-Resource) hardware modeling methodology; (2) a wafer-scale distributed parallel paradigm; and (3) Mesh topology-aware, scalable MeshGEMM and MeshGEMV operators, coupled with hierarchical on-chip memory mapping and scheduling optimizations. Experimental evaluation demonstrates that our system achieves a 200× improvement in wafer-accelerator utilization, attains GEMV throughput 606× higher than state-of-the-art GPUs, and delivers 22× higher energy efficiency. Furthermore, LLM decoding speed increases by 39×, and end-to-end energy efficiency improves by 1.7×.

Technology Category

Application Category

📝 Abstract
Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to fully exploit these accelerators. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR device model that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves 200$ imes$ better wafer-scale accelerator utilization than state-of-the-art systems. On a commodity wafer-scale accelerator, WaferLLM delivers 606$ imes$ faster and 22$ imes$ more energy-efficient GEMV compared to an advanced GPU. For LLMs, WaferLLM enables 39$ imes$ faster decoding with 1.7$ imes$ better energy efficiency. We anticipate these numbers will grow significantly as wafer-scale AI models, software, and hardware continue to mature.
Problem

Research questions and friction points this paper is trying to address.

Optimizes wafer-scale LLM parallelization
Enhances AI core utilization efficiency
Improves computational and energy performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wafer-scale LLM parallelism optimization
Novel PLMR device model utilization
MeshGEMM and MeshGEMV implementations
🔎 Similar Papers
No similar papers found.