🤖 AI Summary
This work addresses the lack of efficient and reusable compilation infrastructure between high-level AI frameworks and hardware accelerators, which hinders automatic generation of high-performance code. Building upon MLIR, the authors propose a modular compiler featuring a lightweight affine analysis pipeline that integrates loop transformations, multi-level tiling, operator and attention-layer fusion, on-chip memory management, and mapping to specialized compute units. Combined with an analytical cost model and heuristic strategies, the system enables fully automated optimization from PyTorch/JAX down to hardware primitives. Evaluated on NVIDIA GPUs, the JIT-compiled code matches or exceeds the performance of Torch Inductor and XLA, with generated matrix multiplication and convolution kernels achieving parity with vendor-optimized libraries or hand-tuned kernels.
📝 Abstract
We present the design and implementation of PolyBlocks, a modular and reusable MLIR-based compiler infrastructure for AI programming frameworks and AI chips. PolyBlocks is based on pass pipelines that compose transformations on loop nests and SSA, primarily relying on lightweight affine access analysis; the transformations are stitched together in specialized ways to realize high-performance code automatically by the use of analytical cost models and heuristics. The optimizations in these passes include multi-level tiling, fusion, on-chip scratchpad usage, mapping matmuls and convolutions to matrix units, fusing the attention layer, and several other transformations for parallelism and locality. They have been developed in a way that makes it easy to build PolyBlocks-based compilers to target new chips, reusing much of the infrastructure. PolyBlocks'design and architecture enable fully automatic code generation from high-level frameworks to low-level target-specific intrinsics. Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation. For individual operators like matmuls and convolutions, PolyBlocks-generated code is competitive with the best vendor-tuned libraries or hand-written kernels.