HE^2: A Communication-Light Heterogeneous Architecture for Efficient Fully Homomorphic Encryption

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the significant computational and memory overheads of CKKS fully homomorphic encryption, which have limited existing ASIC or near-memory acceleration approaches due to high hardware costs, low efficiency, and severe heterogeneous communication latency. To overcome these challenges, the authors propose HE², a communication-light xPU-xMU heterogeneous acceleration architecture that, for the first time, systematically optimizes ModUp and ModDown operations at the dataflow graph (DFG) level. HE² integrates parallel keyswitch blocks to substantially reduce communication frequency and introduces a group-level pipelined execution mechanism that effectively exploits inter-group parallelism to hide communication latency. Compared to the state-of-the-art accelerator, HE² achieves a 1.66× end-to-end performance improvement, a 9.23× reduction in energy-delay-area product (EDAP), and limits communication stalls to only 6.67% of total execution latency.

📝 Abstract

CKKS, an emerging fully homomorphic encryption (FHE) scheme, has been promising in privacy-preserving applications by enabling SIMD fixed-point computations on ciphertexts. Despite its strong security guarantees, CKKS involves both compute-intensive operators (ComOps) with high computational cost and memory-intensive operators (MemOps) with large memory footprints, making existing ASIC-based or NMP-based acceleration approaches suffer from high hardware overhead and limited efficiency. This observation motivates the integration of the architectural advantages of both paradigms into a heterogeneous xPU (ASIC)-xMU (NMP) architecture. However, in such a design, frequent and long-latency heterogeneous communication caused by the dominant keyswitch operator remains a key performance bottleneck. In this paper, we propose $HE^2$, a communication-light xPU-xMU heterogeneous FHE accelerator with dataflow graph (DFG) optimization and architecture co-design. First, we observe that the majority of communication arises at the interface between ModUp/ModDown and neighboring MemOps. To address this, we propose a DFG-level optimization framework to fully exploit the ModUp/ModDown reduction potential of the hoisting algorithm by identifying parallel keyswitch blocks and fusing them for reduced communication frequency. Second, we design an efficient heterogeneous architecture that adopts a group-level pipelined execution to effectively hide communication latency by leveraging the inherent parallelism across decomposed groups. End-to-end evaluation results show that $HE^2$ achieves 1.66$\times$ speedup and 9.23$\times$ lower EDAP (Energy-Delay-Area Product) compared to the state-of-the-art accelerator, with communication stalls accounting for only 6.67% of the total latency.

Problem

Research questions and friction points this paper is trying to address.

Fully Homomorphic Encryption

Heterogeneous Architecture

Keystitch Communication

CKKS

Communication Bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous architecture

fully homomorphic encryption

communication optimization