🤖 AI Summary
This work addresses the challenge of deploying convolutional neural networks (CNNs) on edge devices, where existing MLIR-based high-level synthesis (HLS) frameworks struggle to simultaneously satisfy stringent hardware resource constraints and low-latency requirements. To bridge this gap, the authors propose an automated MLIR-driven HLS flow that integrates, for the first time, resource-aware optimization strategies tailored for edge deployment. By leveraging a streaming data architecture and fine-grained, resource-conscious buffer management, the approach enables efficient end-to-end mapping of CNNs onto FPGAs. Experimental results demonstrate that, compared to state-of-the-art frameworks, the proposed method achieves an average 15× speedup across a standard four-layer CNN kernel, with peak per-layer acceleration reaching 200×, while effectively supporting large input dimensions and adhering to the resource and energy efficiency constraints typical of edge FPGA platforms.
📝 Abstract
Driven by the increasing demand for low-latency and real-time processing, machine learning applications are steadily migrating toward edge computing platforms, where Field-Programmable Gate Arrays (FPGAs) are widely adopted for their energy efficiency compared to CPUs and GPUs. To generate high-performance and low-power FPGA designs, several frameworks built upon High Level Synthesis (HLS) vendor tools have been proposed, among which MLIR-based frameworks are gaining significant traction due to their extensibility and ease of use. However, existing state-of-the-art frameworks often overlook the stringent resource constraints of edge devices. To address this limitation, we propose MING, an Multi-Level Intermediate Representation (MLIR)-based framework that abstracts and automates the HLS design process. Within this framework, we adopt a streaming architecture with carefully managed buffers, specifically designed to handle resource constraints while ensuring low-latency. In comparison with recent frameworks, our approach achieves on average 15x speedup for standard Convolutional Neural Network (CNN) kernels with up to four layers, and up to 200x for single-layer kernels. For kernels with larger input sizes, MING is capable of generating efficient designs that respect hardware resource constraints, whereas state-of-the-art frameworks struggle to meet.