🤖 AI Summary
To address fundamental challenges in FPGA integration within data centers—including low abstraction levels, complex interfaces, and inefficient dynamic partial reconfiguration (DPR)—this paper introduces an open-source shell system for FPGA accelerators. We propose a novel three-layer hierarchical architecture enabling fine-grained, service- and user-logic-aware DPR; provide a unified logical interface; support multithreaded/multi-tenant abstractions; integrate a RoCE v2 network stack; implement an FPGA-GPU DMA engine; enable shared virtual memory; and deliver a high-level programming framework. Experimental evaluation demonstrates a 15–20% reduction in synthesis time and a 10× acceleration in runtime reconfiguration latency. The system successfully deploys real-world applications—including HyperLogLog, AES encryption, and neural network inference—with seamless Python-side invocation. This work significantly enhances FPGA programmability, reusability, and deployment efficiency in heterogeneous computing systems.
📝 Abstract
In the trend towards hardware specialization, FPGAs play a dual role as accelerators for offloading, e.g., network virtualization, and as a vehicle for prototyping and exploring hardware designs. While FPGAs offer versatility and performance, integrating them in larger systems remains challenging. Thus, recent efforts have focused on raising the level of abstraction through better interfaces and high-level programming languages. Yet, there is still quite some room for improvement. In this paper, we present Coyote v2, an open source FPGA shell built with a novel, three-layer hierarchical design supporting dynamic partial reconfiguration of services and user logic, with a unified logic interface, and high-level software abstractions such as support for multithreading and multitenancy. Experimental results indicate Coyote v2 reduces synthesis times between 15% and 20% and run-time reconfiguration times by an order of magnitude, when compared to existing systems. We also demonstrate the advantages of Coyote v2 by deploying several realistic applications, including HyperLogLog cardinality estimation, AES encryption, and neural network inference. Finally, Coyote v2 places a great deal of emphasis on integration with real systems through reusable and reconfigurable services, including a fully RoCE v2-compliant networking stack, a shared virtual memory model with the host, and a DMA engine between FPGAs and GPUs. We demonstrate these features by, e.g., seamlessly deploying an FPGA-accelerated neural network from Python.