Making Locality-aware GEMM Compatible with Page-Granularity Placement on Chiplet GPUs

๐Ÿ“… 2026-06-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the conflict between conventional page-granularity interleaved memory layouts and the locality demands of GEMM operations on chiplet-based GPUs, which incurs substantial remote HBM access overhead. To resolve this, the authors propose Chiplet-Contiguous Layoutโ€”a hardware- and OS-agnostic global memory organization that colocates each chipletโ€™s local data contiguously, thereby enabling locality-aware scheduling while remaining compatible with standard 4KB page management. This approach is the first to unify data locality optimization for GEMM with page-granular memory management in both large language model inference and training. Experiments demonstrate that, compared to 4KB interleaved layouts, the proposed method reduces remote HBM traffic by 24.7ร— and 19.2ร— on Qwen-3 30B and Llama-3.1 70B, respectively, and further achieves 4.1ร— and 2.1ร— reductions over coarse-grained locality-aware placements.
๐Ÿ“ Abstract
Multi-chiplet GPUs scale compute throughput and high-bandwidth memory (HBM) capacity, but their non-uniform memory system makes locality between chiplets and their data critical to the GPU's performance and energy efficiency. Locality-aware scheduling and data placement identify which data should reside near each chiplet. However, in general matrix multiplication (GEMM), locality-aware data placement often becomes incompatible with a fixed page-granularity data interleaving, since the optimal granularity for mapping data across chiplets varies widely across workloads. We propose Chiplet-Contiguous Layout, a global memory layout that stores chiplet-local data contiguously. Chiplet-Contiguous Layout enables locality-aware placement compatible with page-granularity placement across diverse large language model (LLM) GEMM shapes, without changes to the operating system or hardware. On representative LLM inference and training GEMMs from Qwen 3 30B and Llama 3.1 70B, Chiplet-Contiguous Layout on average reduces remote HBM traffic by 24.7x on Qwen and 19.2x on Llama over 4KB interleaving, and by 4.1x and 2.1x over coarse locality-aware placement.
Problem

Research questions and friction points this paper is trying to address.

chiplet GPU
locality-aware placement
page-granularity
GEMM
memory layout
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chiplet GPU
Locality-aware placement
GEMM
Memory layout
Page granularity
๐Ÿ”Ž Similar Papers
No similar papers found.