🤖 AI Summary
A “dual-language gap” persists between algorithm development in high-level languages and hardware implementation in low-level HDLs. Method: This paper introduces the first MLIR-based high-level synthesis (HLS) toolchain natively supporting Julia—compiling Julia kernels directly to vendor-agnostic, synthesizable SystemVerilog RTL without language extensions or manual annotations, while natively integrating AXI4-Stream protocol support. It innovatively enables hybrid static-dynamic scheduling to balance expressiveness and controllability. Contribution/Results: The generated RTL operates stably at 100 MHz on FPGA. On signal processing and mathematical benchmarks, throughput reaches 59.71%–82.6% of leading C/C++ HLS tools. This significantly improves end-to-end development efficiency and hardware portability—from algorithm specification to synthesized RTL—while preserving Julia’s composability and productivity.
📝 Abstract
With the push towards Exascale computing and data-driven methods, problem sizes have increased dramatically, increasing the computational requirements of the underlying algorithms. This has led to a push to offload computations to general purpose hardware accelerators such as GPUs and TPUs, and a renewed interest in designing problem-specific accelerators using FPGAs. However, the development process of these problem-specific accelerators currently suffers from the "two-language problem": algorithms are developed in one (usually higher-level) language, but the kernels are implemented in another language at a completely different level of abstraction and requiring fundamentally different expertise. To address this problem, we propose a new MLIR-based compiler toolchain that unifies the development process by automatically compiling kernels written in the Julia programming language into SystemVerilog without the need for any additional directives or language customisations. Our toolchain supports both dynamic and static scheduling, directly integrates with the AXI4-Stream protocol to interface with subsystems like on- and off-chip memory, and generates vendor-agnostic RTL. This prototype toolchain is able to synthesize a set of signal processing/mathematical benchmarks that can operate at 100MHz on real FPGA devices, achieving between 59.71% and 82.6% of the throughput of designs generated by state-of-the-art toolchains that only compile from low-level languages like C or C++. Overall, this toolchain allows domain experts to write compute kernels in Julia as they normally would, and then retarget them to an FPGA without additional pragmas or modifications.