Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the performance bottlenecks of edge AI inference on AMD Ryzen AI platforms by systematically optimizing General Matrix Multiplication (GEMM) for two generations of NPUs—XDNA and XDNA2. We propose the first cross-generation unified GEMM optimization methodology, deeply integrating hardware-specific characteristics of both architectures through co-designed instruction scheduling, data reuse strategies, and memory bandwidth utilization. Our approach employs custom ISA assembly kernels, hierarchical tiling, double-buffered pipelining, quantization-aware memory access patterns, and mixed-precision bf16/int8 computation. Experimental results demonstrate peak GEMM throughput of 6.76 TOPS (int8) and 3.14 TOPS (bf16) on XDNA, and 38.05 TOPS (int8) and 14.71 TOPS (bf16) on XDNA2—setting new state-of-the-art measured performance records for the Ryzen AI platform. These results significantly overcome architectural adaptation challenges and system-level performance limitations in NPU-accelerated edge inference.

Technology Category

Application Category

📝 Abstract

The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.

Problem

Research questions and friction points this paper is trying to address.

Optimizing GEMM algorithms for AMD Ryzen AI NPUs

Addressing performance bottlenecks in deep learning workloads

Achieving high throughput across different NPU generations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic methodology for GEMM optimization across NPU generations

Exploiting unique architectural features to address performance bottlenecks

Achieving state-of-the-art throughput for int8 and bf16 precision

🔎 Similar Papers

NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads