Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

End-to-end secure inference of large language models (LLMs) in privacy-sensitive domains (e.g., healthcare, finance) faces bottlenecks due to the lack of systematic evaluation of trusted execution environments (TEEs) for large-scale LLM deployment. Method: This work presents the first comprehensive feasibility and practicality assessment of CPU- and GPU-based TEEs—specifically Intel TDX/SGX and NVIDIA H100 confidential computing GPUs—augmented with AMX acceleration, for full-stack secure inference of Llama2-family models within TEEs. Contribution/Results: Experiments show CPU TEEs incur <10% throughput overhead and ≤20% latency increase, offering superior cost-efficiency and security compared to GPU TEEs; GPU TEEs achieve 4–8% overhead, which diminishes further with larger batch sizes. The study empirically validates that modern TEEs can simultaneously deliver high security and computational efficiency, providing foundational evidence and architectural guidance for trustworthy LLM deployment in cloud–HPC converged environments.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).

Problem

Research questions and friction points this paper is trying to address.

Securing confidential LLM inference in privacy-sensitive sectors

Evaluating performance overhead of TEEs on CPU and GPU platforms

Comparing cost-effectiveness and security trade-offs between CPU and GPU TEEs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Trusted Execution Environments for secure LLM inference

Evaluates CPU TEEs with Intel TDX and SGX

Assesses GPU TEEs on NVIDIA H100 Confidential Compute

🔎 Similar Papers

Encryption-Friendly LLM Architecture