AraXL: A Physically Scalable, Ultra-Wide RISC-V Vector Processor Design for Fast and Efficient Computation on Long Vectors

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address energy-efficiency and parallelism bottlenecks in long-vector computation for HPC and ML, this work proposes a highly physically scalable 64-bit RISC-V vector processor architecture. Departing from conventional vector processors—whose parallelism is limited (typically ≤8 lanes, 256×64b registers) by interconnect wiring constraints—the design introduces a novel distributed hierarchical interconnect, enabling up to 64 parallel vector units and a single register capacity of up to 64 KiB. Built upon the RISC-V V 1.0 ISA and fabricated in 22-nm CMOS, the architecture features decoupled vector register files (VRFs) and datapaths, modular lanes, and a hierarchical Network-on-Chip (NoC). The prototype chip achieves 146 GFLOPs peak performance (>99% FPU utilization) and 40.1 GFLOPs/W energy efficiency at 1.15 GHz and 0.8 V (TT corner), with a die area only 3.8× that of a 16-lane baseline.

Technology Category

Application Category

📝 Abstract
The ever-growing scale of data parallelism in today's HPC and ML applications presents a big challenge for computing architectures' energy efficiency and performance. Vector processors address the scale-up challenge by decoupling Vector Register File (VRF) and datapath widths, allowing the VRF to host long vectors and increase register-stored data reuse while reducing the relative cost of instruction fetch and decode. However, even the largest vector processor designs today struggle to scale to more than 8 vector lanes with double-precision Floating Point Units (FPUs) and 256 64-bit elements per vector register. This limitation is induced by difficulties in the physical implementation, which becomes wire-dominated and inefficient. In this work, we present AraXL, a modular and scalable 64-bit RISC-V V vector architecture targeting long-vector applications for HPC and ML. AraXL addresses the physical scalability challenges of state-of-the-art vector processors with a distributed and hierarchical interconnect, supporting up to 64 parallel vector lanes and reaching the maximum Vector Register File size of 64 Kibit/vreg permitted by the RISC-V V 1.0 ISA specification. Implemented in a 22-nm technology node, our 64-lane AraXL achieves a performance peak of 146 GFLOPs on computation-intensive HPC/ML kernels (>99% FPU utilization) and energy efficiency of 40.1 GFLOPs/W (1.15 GHz, TT, 0.8V), with only 3.8x the area of a 16-lane instance.
Problem

Research questions and friction points this paper is trying to address.

High-Performance Computing
Machine Learning
Vector Processor Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

RISC-V V Vector Architecture
Physical Scalability
High Performance Computing/Machine Learning (HPC/ML)
🔎 Similar Papers
No similar papers found.