FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In two-server secure inference, fixed-point nonlinear operations have become a performance bottleneck due to their reliance on custom protocols. This work proposes FuseFSS, the first unified compilation framework that transforms diverse scalar fixed-point operators—via interval partitioning, low-degree polynomial piecewise approximation, and predicate-bit normalization—into batched Function Secret Sharing (FSS) evaluations, eliminating the need for operator-specific protocol design. The approach substantially reduces both communication and preprocessing overhead: it achieves end-to-end speedups of 1.24–1.50× on BERT- and GPT-like models, decreases online communication by 9%–16%, shortens key generation time by 14%–23%, and reduces key size by 20%–24%, all while preserving model accuracy.
📝 Abstract
Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.
Problem

Research questions and friction points this paper is trying to address.

secure inference
function secret sharing
large language models
fixed-point nonlinearities
two-server computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Function Secret Sharing
Secure LLM Inference
Compiler Framework
Fixed-point Nonlinearities
Two-server Protocol