Fast Transformer Inference on ARM-Based HMPSoCs

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the computational and memory bottlenecks hindering Transformer model inference on resource-constrained ARM edge devices, where existing ARM Compute Library (ARM-CL) lacks native support. The study introduces, for the first time, customized Transformer kernels within ARM-CL and proposes a low-overhead CPU-GPU cooperative inference mechanism. By partitioning tasks across heterogeneous processors, memory-intensive operations are scheduled on the CPU while compute-intensive ones execute on the GPU. Experimental results on ARM embedded platforms demonstrate that the proposed approach achieves up to 3× speedup over existing CPU- or GPU-only solutions, with the cooperative inference scheme further reducing end-to-end latency by 15.72%.

📝 Abstract

Transformer models have set new performance standards for machine learning (ML) tasks. However, their resource-intensive deployment on resource-constrained edge devices for cloud-free, on-chip transformer inference remains challenging. The ARM Compute Library (ARM-CL) framework provides low-latency CNN inference on ARM-based edge devices but lacks support for transformer inference. In this work, we implement several new transformer kernels in ARM-CL to support native transformer execution. Our extended ARM-CL achieves up to three times faster transformer inference compared to state-of-the-art CPU/GPU implementations on an ARM-based embedded board. Furthermore, heterogeneous multi-processor system-on-chips (HMPSoCs) powering edge devices provide both embedded CPUs and GPUs. We introduce cooperative CPU-GPU transformer inference, which executes memory-intensive operations on the CPU while utilizing the GPU for highly parallelizable, compute-intensive operations. This cooperative execution, implemented with minimal overhead, further reduces transformer inference latency by up to 15.72% compared to the best single-processor inference on ARM-CL.

Problem

Research questions and friction points this paper is trying to address.

Transformer inference

ARM-based HMPSoCs

edge devices

resource-constrained

on-chip inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer inference

ARM Compute Library

HMPSoC