Experience Deploying Containerized GenAI Services at an HPC Center

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-performance computing (HPC) centers face challenges in supporting containerized generative AI (GenAI) services and ensuring cross-platform reproducibility. Method: This paper proposes a unified architecture integrating HPC and cloud-native technologies—orchestrated via Kubernetes and incorporating the vLLM inference server, multi-container runtimes (e.g., Singularity/CRI-O), object storage, and vector databases—to enable seamless deployment and coordinated execution of GenAI components across heterogeneous HPC environments. Contribution/Results: The approach breaks down traditional isolation between HPC and cloud-native ecosystems, enabling high-fidelity, cross-platform reproducibility of containerized AI workloads. Evaluated on Llama-series models, the system demonstrates superior stability, inference throughput, and deployment consistency compared to pure-HPC or pure-cloud alternatives. It establishes a reusable deployment paradigm for GenAI services, significantly enhancing HPC centers’ capability to support large-model inference workloads.

Technology Category

Application Category

📝 Abstract
Generative Artificial Intelligence (GenAI) applications are built from specialized components -- inference servers, object storage, vector and graph databases, and user interfaces -- interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.
Problem

Research questions and friction points this paper is trying to address.

Deploying containerized GenAI services in HPC environments
Integrating HPC and Kubernetes platforms for AI workloads
Running containerized inference servers across different platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated HPC and Kubernetes architecture
Containerized inference servers across platforms
Multiple container runtimes for reproducibility
🔎 Similar Papers
No similar papers found.
A
Angel M. Beltre
Sandia National Laboratories, Scalable System Software, Albuquerque, New Mexico, USA
J
Jeff Ogden
Sandia National Laboratories, HPC Systems, Albuquerque, New Mexico, USA
Kevin Pedretti
Kevin Pedretti
Sandia National Laboratories
High Performance Computing - Operating Systems - Networking