The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Modern CPUs (x86_64, ARM, RISC-V) are increasingly integrating mixed-precision integer (MIP) dot-product units, yet efficient MIP general matrix multiplication (GEMM) kernels remain underexplored across architectures. Method: We propose the first cross-architecture, high-performance MIP-GEMM kernel design, featuring: (i) hardware-native microkernels; (ii) a customized data layout optimized for MIP computation characteristics; and (iii) a unified ISA abstraction supporting AVX-512, SVE, and RISC-V V extensions. Contribution/Results: This work presents the first systematic, high-efficiency port of MIP-GEMM to mainstream CPUs, breaking from conventional floating-point GEMM optimization paradigms. Evaluation shows an average 2.1× speedup over FP32 GEMM across x86_64, ARM, and RISC-V platforms, with 47% lower memory bandwidth consumption and 39% reduced energy usage—significantly enhancing quantized inference efficiency at the edge.

Technology Category

Application Category

📝 Abstract

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the"Cambrian period"for matrix multiplication.

Problem

Research questions and friction points this paper is trying to address.

Adapting matrix multiplication to mixed-precision integer arithmetic

Optimizing gemm for quantized deep learning inference

Enhancing performance across x86_64, ARM, and RISC-V ISAs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision arithmetic for quantized DL

Novel micro-kernel designs for modern ISAs

Optimized data layouts for specialized hardware

🔎 Similar Papers

No similar papers found.

Authors to Follow