๐ค AI Summary
This work addresses the conflict between conventional page-granularity interleaved memory layouts and the locality demands of GEMM operations on chiplet-based GPUs, which incurs substantial remote HBM access overhead. To resolve this, the authors propose Chiplet-Contiguous Layoutโa hardware- and OS-agnostic global memory organization that colocates each chipletโs local data contiguously, thereby enabling locality-aware scheduling while remaining compatible with standard 4KB page management. This approach is the first to unify data locality optimization for GEMM with page-granular memory management in both large language model inference and training. Experiments demonstrate that, compared to 4KB interleaved layouts, the proposed method reduces remote HBM traffic by 24.7ร and 19.2ร on Qwen-3 30B and Llama-3.1 70B, respectively, and further achieves 4.1ร and 2.1ร reductions over coarse-grained locality-aware placements.
๐ Abstract
Multi-chiplet GPUs scale compute throughput and high-bandwidth memory (HBM) capacity, but their non-uniform memory system makes locality between chiplets and their data critical to the GPU's performance and energy efficiency. Locality-aware scheduling and data placement identify which data should reside near each chiplet. However, in general matrix multiplication (GEMM), locality-aware data placement often becomes incompatible with a fixed page-granularity data interleaving, since the optimal granularity for mapping data across chiplets varies widely across workloads. We propose Chiplet-Contiguous Layout, a global memory layout that stores chiplet-local data contiguously. Chiplet-Contiguous Layout enables locality-aware placement compatible with page-granularity placement across diverse large language model (LLM) GEMM shapes, without changes to the operating system or hardware. On representative LLM inference and training GEMMs from Qwen 3 30B and Llama 3.1 70B, Chiplet-Contiguous Layout on average reduces remote HBM traffic by 24.7x on Qwen and 19.2x on Llama over 4KB interleaving, and by 4.1x and 2.1x over coarse locality-aware placement.