Experience Deploying Containerized GenAI Services at an HPC Center

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

High-performance computing (HPC) centers face challenges in supporting containerized generative AI (GenAI) services and ensuring cross-platform reproducibility. Method: This paper proposes a unified architecture integrating HPC and cloud-native technologies—orchestrated via Kubernetes and incorporating the vLLM inference server, multi-container runtimes (e.g., Singularity/CRI-O), object storage, and vector databases—to enable seamless deployment and coordinated execution of GenAI components across heterogeneous HPC environments. Contribution/Results: The approach breaks down traditional isolation between HPC and cloud-native ecosystems, enabling high-fidelity, cross-platform reproducibility of containerized AI workloads. Evaluated on Llama-series models, the system demonstrates superior stability, inference throughput, and deployment consistency compared to pure-HPC or pure-cloud alternatives. It establishes a reusable deployment paradigm for GenAI services, significantly enhancing HPC centers’ capability to support large-model inference workloads.

Technology Category

Application Category

📝 Abstract

Generative Artificial Intelligence (GenAI) applications are built from specialized components -- inference servers, object storage, vector and graph databases, and user interfaces -- interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.

Problem

Research questions and friction points this paper is trying to address.

Deploying containerized GenAI services in HPC environments

Integrating HPC and Kubernetes platforms for AI workloads

Running containerized inference servers across different platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated HPC and Kubernetes architecture

Containerized inference servers across platforms

Multiple container runtimes for reproducibility

🔎 Similar Papers

No similar papers found.