🤖 AI Summary
This work addresses the high memory and hardware overhead in CKKS homomorphic encryption caused by linear transformations that rely heavily on ciphertext rotations. The authors propose a triply optimized Baby-Step Giant-Step (BSGS) algorithm that significantly reduces the number of required rotations by further decomposing the baby-step phase. This approach is integrated with a multi-stage datapath partitioning strategy, a custom permutation circuit tailored for message routing, and a memory-optimized architecture to minimize off-chip memory accesses and computational latency. Experimental results on a Xilinx Virtex UltraScale+ platform demonstrate a 2.9× reduction in off-chip memory traffic and a 5.8× improvement in computation latency compared to the state-of-the-art solution.
📝 Abstract
Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.