๐ค AI Summary
This work addresses the performance bottlenecks of edge AI inference on AMD Ryzen AI platforms by systematically optimizing General Matrix Multiplication (GEMM) for two generations of NPUsโXDNA and XDNA2. We propose the first cross-generation unified GEMM optimization methodology, deeply integrating hardware-specific characteristics of both architectures through co-designed instruction scheduling, data reuse strategies, and memory bandwidth utilization. Our approach employs custom ISA assembly kernels, hierarchical tiling, double-buffered pipelining, quantization-aware memory access patterns, and mixed-precision bf16/int8 computation. Experimental results demonstrate peak GEMM throughput of 6.76 TOPS (int8) and 3.14 TOPS (bf16) on XDNA, and 38.05 TOPS (int8) and 14.71 TOPS (bf16) on XDNA2โsetting new state-of-the-art measured performance records for the Ryzen AI platform. These results significantly overcome architectural adaptation challenges and system-level performance limitations in NPU-accelerated edge inference.
๐ Abstract
The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.