🤖 AI Summary
Modern CPUs (x86_64, ARM, RISC-V) are increasingly integrating mixed-precision integer (MIP) dot-product units, yet efficient MIP general matrix multiplication (GEMM) kernels remain underexplored across architectures.
Method: We propose the first cross-architecture, high-performance MIP-GEMM kernel design, featuring: (i) hardware-native microkernels; (ii) a customized data layout optimized for MIP computation characteristics; and (iii) a unified ISA abstraction supporting AVX-512, SVE, and RISC-V V extensions.
Contribution/Results: This work presents the first systematic, high-efficiency port of MIP-GEMM to mainstream CPUs, breaking from conventional floating-point GEMM optimization paradigms. Evaluation shows an average 2.1× speedup over FP32 GEMM across x86_64, ARM, and RISC-V platforms, with 47% lower memory bandwidth consumption and 39% reduced energy usage—significantly enhancing quantized inference efficiency at the edge.
📝 Abstract
Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the"Cambrian period"for matrix multiplication.