🤖 AI Summary
Optimizing mixed-precision deep learning operators—e.g., mixed-type matrix multiplication—on GPUs remains challenging: high-level compilers (e.g., Triton) lack expressive power for fine-grained hardware control, while low-level libraries (e.g., CUTLASS) incur prohibitive development overhead. To bridge this gap, we propose Hexcute, a novel tiled programming language. Its core innovation is the first type-inference-driven synthesis algorithm that jointly optimizes data layout and task mapping; it explicitly exposes shared memory and register abstractions, enables fine-grained data pipelining, and enforces hardware-friendly memory layouts. Hexcute achieves high expressivity while drastically reducing GPU programming complexity. Experiments show that Hexcute accelerates mixed-precision operators by 1.7×–11.28× over state-of-the-art deep learning compilers, delivers up to 2.91× end-to-end speedup, and generalizes effectively across diverse deep learning operators.
📝 Abstract
Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$ imes$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$ imes$ speedup in the end-to-end evaluation.