TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of efficiently supporting mainstream low-bit quantization formats (e.g., AWQ) for large language model (LLM) inference on existing client NPUs, which is hindered by closed software stacks. Focusing on the AMD XDNA2 NPU, the authors propose a hardware-aware mixed-precision kernel library that natively supports common quantized formats such as W4A16 and W8A16 without requiring model restructuring. This is achieved through an interleaved pre-tiled weight layout, fused unpacking/dequantization with GEMM/GEMV computation, and optimized dataflow and microkernel design for the AIE array. Experiments demonstrate 121.6% and 281% performance improvements in GEMM and GEMV operations, respectively. Compared to an integrated GPU baseline, the approach reduces end-to-end LLM prefill latency by 2× and cuts energy consumption by over 64.6%.

📝 Abstract

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

Problem

Research questions and friction points this paper is trying to address.

quantized LLM inference

NPU software stack

AWQ quantization

edge deployment

mixed-precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision kernels

quantized LLM inference

NPU co-design