🤖 AI Summary
This work addresses the lack of quantitative analysis on virtual memory impact for open-source RISC-V Vector Extension (RVV) processors in full-system Linux environments. We present the first Linux-compatible OS support for the Ara2 vector core. Methodologically, we reuse the CVA6 scalar core’s MMU to enable virtual memory access for the vector core, proposing a shared-MMU architecture; we further model the TLB and conduct microbenchmarking—including matrix multiplication and RiVEC—on the Cheshire SoC platform. Key contributions are: (1) the first Linux-level OS support for an open-source RVV processor; (2) systematic quantification of the trade-off between TLB capacity and vector performance; and (3) virtual memory overhead constrained to ≤3.5% (with ≥16 TLB entries), achieving a 3.2× peak average speedup for 2-lane AraOS over pure scalar execution.
📝 Abstract
Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution.