🤖 AI Summary
To address the high overhead of collective communication and inflexible switch functionality in high-performance computing (HPC), this paper proposes ACiS, the first framework to systematically define four in-network computing paradigms: scalar computation, user-defined type processing, stateful lookup-table operations, and fused collective operations—thereby overcoming the limitation of conventional switches that only support fixed collective primitives. Leveraging a coarse-grained reconfigurable architecture (CGRA), we design programmable hardware extensions for network switches and develop an end-to-end compilation toolchain that automatically maps MPI programs onto switch hardware with software–hardware co-scheduling. Experimental evaluation on real switch hardware demonstrates that all four paradigms are highly efficient and feasible: collective communication latency is reduced by up to 57%, host CPU utilization decreases significantly, and the optimization remains fully transparent to upper-layer applications.
📝 Abstract
For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.