🤖 AI Summary
To address the challenges of poor compatibility between homomorphic encryption (HE) and large language model (LLM) architectures, as well as excessive computational overhead in private LLM inference, this paper proposes a non-interactive, efficient, and secure inference framework. Our approach integrates CKKS HE with the BitNet binary architecture—enabling zero-shot, retraining-free deployment for the first time. To overcome numerical instability and high HE costs associated with Softmax, we replace it with an HE-friendly Sigmoid-based attention mechanism. Furthermore, we embed bootstrapping into RMSNorm, reducing its invocation frequency to just 1%. Experiments demonstrate that our method achieves an 8× speedup in encrypted matrix multiplication and a 2.6× acceleration in attention computation over state-of-the-art baselines. This significantly improves CPU-based private LLM inference efficiency, offering a practical pathway toward scalable, privacy-preserving LLM deployment.
📝 Abstract
Secure inference enables privacy-preserving machine learning by leveraging cryptographic protocols that support computations on sensitive user data without exposing it. However, integrating cryptographic protocols with large language models (LLMs) presents significant challenges, as the inherent complexity of these protocols, together with LLMs' massive parameter scale and sophisticated architectures, severely limits practical usability. In this work, we propose ENSI, a novel non-interactive secure inference framework for LLMs, based on the principle of co-designing the cryptographic protocols and LLM architecture. ENSI employs an optimized encoding strategy that seamlessly integrates CKKS scheme with a lightweight LLM variant, BitNet, significantly reducing the computational complexity of encrypted matrix multiplications. In response to the prohibitive computational demands of softmax under homomorphic encryption (HE), we pioneer the integration of the sigmoid attention mechanism with HE as a seamless, retraining-free alternative. Furthermore, by embedding the Bootstrapping operation within the RMSNorm process, we efficiently refresh ciphertexts while markedly decreasing the frequency of costly bootstrapping invocations. Experimental evaluations demonstrate that ENSI achieves approximately an 8x acceleration in matrix multiplications and a 2.6x speedup in softmax inference on CPU compared to state-of-the-art method, with the proportion of bootstrapping is reduced to just 1%.