Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)

πŸ“… 2026-02-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the memory bandwidth bottleneck and limited flexibility of conventional library-based approaches in deploying AI models on Qualcomm’s Hexagon NPU by introducing the first open-source MLIR compilation stack tailored for this architecture. The proposed stack enables automatic compilation of optimized NPU binaries from both Triton kernels and PyTorch 2.0 subgraphs. Through a structured compilation pipeline, mega-kernel fusion, and explicit optimization of data locality in tightly coupled memory (TCM), it significantly enhances data reuse and alleviates off-chip bandwidth pressure. Notably, this study presents the first end-to-end compilation framework from Triton to Hexagon NPU, establishing a high-performance, scalable, and open-source deployment pathway for edge AI applications.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.
Problem

Research questions and friction points this paper is trying to address.

AI compilation
Neural Processing Unit
Triton kernels
PyTorch models
bandwidth bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hexagon-MLIR
Neural Processing Unit (NPU)
Triton kernels
Tightly Coupled Memory (TCM)
MLIR
M
Mohammed Javed Absar
Qualcomm Technologies International, Ltd.
M
Muthu Baskaran
Qualcomm Technologies, Inc.
A
Abhikrant Sharma
Qualcomm Technologies, Inc.
A
Abhilash Bhandari
Qualcomm India Private Limited
A
Ankit Aggarwal
Qualcomm Technologies, Inc.
A
Arun Rangasamy
Qualcomm India Private Limited
Dibyendu Das
Dibyendu Das
Intel
Compiler
F
Fateme Hosseini
Qualcomm Technologies, Inc.
F
Franck Slama
Qualcomm Technologies, Inc.
I
Iulian Brumar
Qualcomm Technologies, Inc.
J
Jyotsna Verma
Qualcomm Technologies, Inc.
K
Krishnaprasad Bindumadhavan
Qualcomm India Private Limited
M
Mitesh Kothari
Qualcomm Technologies, Inc.
M
Mohit Gupta
Qualcomm Technologies, Inc.
R
Ravishankar Kolachana
Qualcomm India Private Limited
R
Richard Lethin
Qualcomm Technologies, Inc.
S
Samarth Narang
Qualcomm Technologies, Inc.
S
Sanjay Motilal Ladwa
Qualcomm India Private Limited
S
Shalini Jain
Qualcomm India Private Limited
S
Snigdha Suresh Dalvi
Qualcomm Technologies, Inc.
T
Tasmia Rahman
Qualcomm Technologies, Inc.
V
Venkat Rasagna Reddy Komatireddy
Qualcomm India Private Limited
V
Vivek Vasudevbhai Pandya
Qualcomm India Private Limited
X
Xiyue Shi
Qualcomm Technologies, Inc.
Z
Zachary Zipper
Qualcomm Technologies, Inc.