HE^2: A Communication-Light Heterogeneous Architecture for Efficient Fully Homomorphic Encryption

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

229K/year
πŸ€– AI Summary
This work addresses the significant computational and memory overheads of CKKS fully homomorphic encryption, which have limited existing ASIC or near-memory acceleration approaches due to high hardware costs, low efficiency, and severe heterogeneous communication latency. To overcome these challenges, the authors propose HEΒ², a communication-light xPU-xMU heterogeneous acceleration architecture that, for the first time, systematically optimizes ModUp and ModDown operations at the dataflow graph (DFG) level. HEΒ² integrates parallel keyswitch blocks to substantially reduce communication frequency and introduces a group-level pipelined execution mechanism that effectively exploits inter-group parallelism to hide communication latency. Compared to the state-of-the-art accelerator, HEΒ² achieves a 1.66Γ— end-to-end performance improvement, a 9.23Γ— reduction in energy-delay-area product (EDAP), and limits communication stalls to only 6.67% of total execution latency.
πŸ“ Abstract
CKKS, an emerging fully homomorphic encryption (FHE) scheme, has been promising in privacy-preserving applications by enabling SIMD fixed-point computations on ciphertexts. Despite its strong security guarantees, CKKS involves both compute-intensive operators (ComOps) with high computational cost and memory-intensive operators (MemOps) with large memory footprints, making existing ASIC-based or NMP-based acceleration approaches suffer from high hardware overhead and limited efficiency. This observation motivates the integration of the architectural advantages of both paradigms into a heterogeneous xPU (ASIC)-xMU (NMP) architecture. However, in such a design, frequent and long-latency heterogeneous communication caused by the dominant keyswitch operator remains a key performance bottleneck. In this paper, we propose $HE^2$, a communication-light xPU-xMU heterogeneous FHE accelerator with dataflow graph (DFG) optimization and architecture co-design. First, we observe that the majority of communication arises at the interface between ModUp/ModDown and neighboring MemOps. To address this, we propose a DFG-level optimization framework to fully exploit the ModUp/ModDown reduction potential of the hoisting algorithm by identifying parallel keyswitch blocks and fusing them for reduced communication frequency. Second, we design an efficient heterogeneous architecture that adopts a group-level pipelined execution to effectively hide communication latency by leveraging the inherent parallelism across decomposed groups. End-to-end evaluation results show that $HE^2$ achieves 1.66$\times$ speedup and 9.23$\times$ lower EDAP (Energy-Delay-Area Product) compared to the state-of-the-art accelerator, with communication stalls accounting for only 6.67% of the total latency.
Problem

Research questions and friction points this paper is trying to address.

Fully Homomorphic Encryption
Heterogeneous Architecture
Keystitch Communication
CKKS
Communication Bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous architecture
fully homomorphic encryption
communication optimization
dataflow graph
keyswitch fusion
πŸ”Ž Similar Papers
S
Shangyi Shi
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Cambricon Technologies
Husheng Han
Husheng Han
Institute of Computing Technology, Chinese Academy of Sciences
Computer architectureSecurityDNNDomain-Specific Accelerator
Z
Zhaoxuan Kan
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Cambricon Technologies
Y
Yinghao Yang
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China
Jianan Mu
Jianan Mu
Institute of Computing Technology, State Key Laboratory of Processors (SKLP), CAS
Design AutomationAccelaretorPrivacy Preserving Computing
T
Tenghui Hua
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China
G
Ge Yu
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; School of Advanced Interdisciplinary Sciences, CAS, Beijing, China
Xinyao Zheng
Xinyao Zheng
University of California Riverside
Ling Liang
Ling Liang
pku.edu.cn
Z
Zidong Du
State Key Laboratory of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Xing Hu
Xing Hu
Institute of Computing Technology, Chinese Academy of Sciences
micro-architectureDeep learning architecture