🤖 AI Summary
In multi-chiplet GPUs, GEMM operations suffer significant performance and energy efficiency degradation due to remote HBM accesses, while determining optimal data layouts and CTA scheduling strategies remains challenging. This work proposes a fast, functional-level, tile-granularity locality simulator that models CTA scheduling, per-chiplet L2 caches, and local/remote HBM accesses to efficiently evaluate inter-chiplet traffic for full-scale GEMMs in large language models. The study reveals, for the first time, that CTA traversal order is a first-order design variable governing remote traffic. Through AI-driven design space exploration, it demonstrates that a 2D block-interleaved traversal strategy can reduce remote traffic by up to 5.1× compared to the best 1D strategy, with traffic differences as high as 90× across different scheduling policies under identical GEMM configurations.
📝 Abstract
Multi-chiplet GPUs split memory into local and remote HBM regions across a silicon interposer, and reducing the remote HBM traffic is crucial for the performance and energy efficiency of multi-chiplet GPUs. For general matrix multiplication (GEMM), the dominant operator in large language models (LLMs), the resulting inter-chiplet traffic depends strongly on kernel choices such as operand layout, CTA traversal order, and data placement, and the optimal strategy to minimize remote accesses is nontrivial. We present a fast, functional, tile-level locality simulator that models CTA scheduling, per-chiplet L2 caches, and local/remote HBM accesses to evaluate a full-size LLM GEMM configuration. Across representative LLM GEMMs, the simulator shows that remote traffic varies by up to 90x across the design space for the same GEMM dimensions. Moreover, using the simulator as feedback, an agentic AI discovers that a 2D block-swizzle CTA traversal reduces remote traffic over the best 1D traversal by up to 5.1x under round-robin placement, identifying CTA traversal order as a first-order, GEMM-dependent design knob for inter-chiplet traffic.