Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address efficiency and energy-efficiency bottlenecks of General Matrix Multiplication (GEMM) on heterogeneous hardware, this work proposes a fine-grained adaptive mixed-precision framework. Unlike conventional layer- or tensor-level coarse-grained approaches, it dynamically selects optimal numerical precisions (e.g., FP16/FP32/FP64) at the block level, tightly coupling precision selection with hardware-aware block scheduling. The framework integrates the PaRSEC runtime to enable cross-architecture task load balancing and low-overhead precision transitions across ARM CPUs, NVIDIA GPUs, and AMD GPUs. This bridges the gap between algorithmic numerical robustness requirements and hardware-specific computational capability and energy-efficiency characteristics. Evaluations on supercomputing platforms—including Fugaku, Frontier, and NVIDIA A100 DGX—demonstrate up to 2.1× speedup and 1.8× energy-efficiency improvement over single-precision baselines, while preserving numerical stability critical for scientific applications.

Technology Category

Application Category

📝 Abstract
General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at fine-grained tile/block levels. We utilize the PaRSEC runtime system to balance workloads across various architectures. The performance scales well on ARM CPU-based Fugaku supercomputer, Nvidia GPU-based A100 DGX, and AMD GPU-based Frontier supercomputer. This research aims to enhance computational efficiency and accuracy by bridging algorithmic advancements and hardware innovations, driving transformative progress in various applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizing mixed-precision matrix multiplication for hardware efficiency
Developing tile-level adaptive precision for GEMM operations
Bridging algorithmic advancements with emerging hardware capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision GEMM with tile-level adaptation
PaRSEC runtime for cross-architecture workload balancing
Optimized performance across ARM/NVIDIA/AMD supercomputers
🔎 Similar Papers
No similar papers found.
Q
Qiao Zhang
Saint Louis University, USA
R
Rabab Alomairy
Massachusetts Institute of Technology, USA; King Abdullah University of Science and Technology, KSA
Dali Wang
Dali Wang
ORNL
Exascale High-Res Land ModelArtificial IntelligenceHigh Performance Computing
Z
Zhuowei Gu
Saint Louis University, USA
Qinglei Cao
Qinglei Cao
Saint Louis University
HPCAI