AraOS: Analyzing the Impact of Virtual Memory Management on Vector Unit Performance

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of quantitative analysis on virtual memory impact for open-source RISC-V Vector Extension (RVV) processors in full-system Linux environments. We present the first Linux-compatible OS support for the Ara2 vector core. Methodologically, we reuse the CVA6 scalar core’s MMU to enable virtual memory access for the vector core, proposing a shared-MMU architecture; we further model the TLB and conduct microbenchmarking—including matrix multiplication and RiVEC—on the Cheshire SoC platform. Key contributions are: (1) the first Linux-level OS support for an open-source RVV processor; (2) systematic quantification of the trade-off between TLB capacity and vector performance; and (3) virtual memory overhead constrained to ≤3.5% (with ≥16 TLB entries), achieving a 3.2× peak average speedup for 2-lane AraOS over pure scalar execution.

Technology Category

Application Category

📝 Abstract

Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution.

Problem

Research questions and friction points this paper is trying to address.

Analyzing virtual memory impact on vector processor performance

Evaluating virtual-to-physical address translation overhead in vector units

Benchmarking RVV processor performance in full-system OS environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates OS support for Ara2 vector processor

Shares MMU of CVA6 for address translation

Benchmarks performance with virtual memory overhead

🔎 Similar Papers

No similar papers found.

Authors to Follow