π€ AI Summary
This work addresses the memory bandwidth bottleneck and limited flexibility of conventional library-based approaches in deploying AI models on Qualcommβs Hexagon NPU by introducing the first open-source MLIR compilation stack tailored for this architecture. The proposed stack enables automatic compilation of optimized NPU binaries from both Triton kernels and PyTorch 2.0 subgraphs. Through a structured compilation pipeline, mega-kernel fusion, and explicit optimization of data locality in tightly coupled memory (TCM), it significantly enhances data reuse and alleviates off-chip bandwidth pressure. Notably, this study presents the first end-to-end compilation framework from Triton to Hexagon NPU, establishing a high-performance, scalable, and open-source deployment pathway for edge AI applications.
π Abstract
In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.