🤖 AI Summary
This work addresses two key bottlenecks in fully homomorphic encryption (FHE)-based inference for deep CNNs: the computational inefficiency of homomorphically evaluating nonlinear activation functions (e.g., ReLU, SiLU), and the ciphertext capacity constraint that hinders high-resolution image processing. To overcome these, we propose a synergistic optimization framework combining single-stage fine-tuning (SFT) and generalized interleaved packing (GIP). Our approach employs low-degree polynomial approximations to precisely emulate activations and designs dedicated homomorphic operators accordingly. This enables, for the first time, end-to-end FHE-based object detection using the YOLO architecture—supporting arbitrary-resolution feature maps. Evaluated on CIFAR-10, ImageNet, and MS COCO, our method achieves accuracy comparable to plaintext baselines while significantly improving FHE inference efficiency and practicality. The proposed framework establishes a scalable new paradigm for privacy-preserving visual computing.
📝 Abstract
We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.