AERO: Softmax-Only LLMs for Efficient Private Inference

📅 2024-10-16
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address the high communication overhead and latency induced by non-linear operations in Transformers under encrypted inputs for privacy-preserving inference (PI), this paper proposes the first fully Softmax-based lightweight LLM architecture. Our method eliminates all non-linear modules—including LayerNorm and GELU—retaining only Softmax as the sole non-linear activation. To mitigate entropy collapse in deep layers and entropy overload in shallow layers caused by the absence of other non-linearities, we introduce an entropy regularization mechanism. Furthermore, we achieve end-to-end adaptation for private inference. The resulting architecture preserves model expressivity while substantially reducing both computational and communication complexity. Experiments demonstrate that, compared to state-of-the-art approaches, our method reduces communication overhead by 4.23× and inference latency by 1.94×, with consistent effectiveness and generalization across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23$ imes$ communication and 1.94$ imes$ latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Addresses prohibitive latency in private LLM inference
Solves entropy collapse and overload in transformer layers
Strategically eliminates costly nonlinear operations adaptively
Innovation

Methods, ideas, or system contributions that make the work stand out.

AERO framework strategically removes transformer nonlinear operations
Employs head-wise entropy regularizer with learnable strengths
Adaptively recalibrates attention heads to prevent entropy extremes
🔎 Similar Papers
No similar papers found.