Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the conflict between conventional page-granularity interleaved memory layouts and the locality demands of GEMM operations on chiplet-based GPUs, which incurs substantial remote HBM access overhead. To resolve this, the authors propose Chiplet-Contiguous Layout—a hardware- and OS-agnostic global memory organization that colocates each chiplet’s local data contiguously, thereby enabling locality-aware scheduling while remaining compatible with standard 4KB page management. This approach is the first to unify data locality optimization for GEMM with page-granular memory management in both large language model inference and training. Experiments demonstrate that, compared to 4KB interleaved layouts, the proposed method reduces remote HBM traffic by 24.7× and 19.2× on Qwen-3 30B and Llama-3.1 70B, respectively, and further achieves 4.1× and 2.1× reductions over coarse-grained locality-aware placements.

📝 Abstract

Multi-chiplet GPUs scale compute throughput and high-bandwidth memory (HBM) capacity, but their non-uniform memory system makes locality between chiplets and their data critical to the GPU's performance and energy efficiency. Locality-aware scheduling and data placement identify which data should reside near each chiplet. However, in general matrix multiplication (GEMM), locality-aware data placement often becomes incompatible with a fixed page-granularity data interleaving, since the optimal granularity for mapping data across chiplets varies widely across workloads. We propose Chiplet-Contiguous Layout, a global memory layout that stores chiplet-local data contiguously. Chiplet-Contiguous Layout enables locality-aware placement compatible with page-granularity placement across diverse large language model (LLM) GEMM shapes, without changes to the operating system or hardware. On representative LLM inference and training GEMMs from Qwen 3 30B and Llama 3.1 70B, Chiplet-Contiguous Layout on average reduces remote HBM traffic by 24.7x on Qwen and 19.2x on Llama over 4KB interleaving, and by 4.1x and 2.1x over coarse locality-aware placement.

Problem

Research questions and friction points this paper is trying to address.

chiplet GPU

locality-aware placement

page-granularity

GEMM

memory layout

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chiplet GPU

Locality-aware placement

GEMM