Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

๐Ÿ“… 2025-12-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the performance bottlenecks of edge AI inference on AMD Ryzen AI platforms by systematically optimizing General Matrix Multiplication (GEMM) for two generations of NPUsโ€”XDNA and XDNA2. We propose the first cross-generation unified GEMM optimization methodology, deeply integrating hardware-specific characteristics of both architectures through co-designed instruction scheduling, data reuse strategies, and memory bandwidth utilization. Our approach employs custom ISA assembly kernels, hierarchical tiling, double-buffered pipelining, quantization-aware memory access patterns, and mixed-precision bf16/int8 computation. Experimental results demonstrate peak GEMM throughput of 6.76 TOPS (int8) and 3.14 TOPS (bf16) on XDNA, and 38.05 TOPS (int8) and 14.71 TOPS (bf16) on XDNA2โ€”setting new state-of-the-art measured performance records for the Ryzen AI platform. These results significantly overcome architectural adaptation challenges and system-level performance limitations in NPU-accelerated edge inference.

Technology Category

Application Category

๐Ÿ“ Abstract
The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GEMM algorithms for AMD Ryzen AI NPUs
Addressing performance bottlenecks in deep learning workloads
Achieving high throughput across different NPU generations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic methodology for GEMM optimization across NPU generations
Exploiting unique architectural features to address performance bottlenecks
Achieving state-of-the-art throughput for int8 and bf16 precision
๐Ÿ”Ž Similar Papers
No similar papers found.
E
Endri Taka
The University of Texas at Austin
A
Andre Roesti
Advanced Micro Devices, Inc.
J
Joseph Melber
Advanced Micro Devices, Inc.
P
Pranathi Vasireddy
Advanced Micro Devices, Inc.
Kristof Denolf
Kristof Denolf
Principal Engineer, Xilinx
Cost Efficient Vision Processing
Diana Marculescu
Diana Marculescu
The University of Texas at Austin
Efficient Deep LearningEnergy-aware ComputingLow Power DesignDesign AutomationDark Silicon