🤖 AI Summary
This work addresses the lack of an efficient lightweight machine learning runtime for AI inference on the VideoCore VII QPU of the Raspberry Pi 5. It presents the first end-to-end ML runtime stack tailored to this architecture, built upon the py-videocore7 assembly library. The implementation includes optimized kernels for tiled matrix multiplication, GEMM-based convolution, and single-head attention. A key innovation is the introduction of the smul24 instruction to enable integer-dense computation with INT16 inputs and INT32 accumulation, further enhanced by a persistent executor for improved efficiency. Experimental results demonstrate that the proposed dense operators achieve nearly two orders of magnitude speedup over NumPy and significantly outperform both PyTorch and NumPy across multiple workloads, underscoring the Raspberry Pi 5’s potential as an edge AI acceleration platform.
📝 Abstract
We present a QPU-first ML runtime stack for Raspberry Pi 5's VideoCore VII QPU, built on top of the py-videocore7 assembly library. The system comprises reusable tiled matrix-multiplication substrate, GEMM-backed convolution, a single-head attention-style core, persistent executors, and integer execution based on smul24 instructions. For dense integer kernels, packed INT16-input with INT32 accumulation achieves nearly two orders of magnitude higher throughput over NumPy. Across operations (min/max, pooling, convolution, attention), we report improved performance over both PyTorch and NumPy. Our preliminary results indicate that Raspberry QPUs can serve as a practical execution substrate towards accelerating AI model execution at the edge.