Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

📅 2024-06-03

🏛️ International Conference on Architectural Support for Programming Languages and Operating Systems

📈 Citations: 4

✨ Influential: 0

career value

273K/year

🤖 AI Summary

Large language model (LLM) inference in heterogeneous GPU clusters faces conflicting challenges of hardware resource heterogeneity and stringent low-latency requirements. Method: This paper proposes a unified joint modeling framework that formalizes both model deployment and request scheduling as a maximum flow problem on a directed weighted graph, solved globally via mixed-integer linear programming (MILP). Crucially, GPUs serve as graph nodes, while network bandwidth and computational capacity constitute edge weights; graph representation learning is integrated to enable resource-aware scheduling. Contribution/Results: The approach overcomes the performance bottlenecks of conventional decoupled strategies. Evaluated on real-world heterogeneous clusters comprising 24–42 GPUs, it achieves up to a 3.3× throughput improvement, a 66% reduction in time-to-first-token latency, and a 24% reduction in decoding latency.

Technology Category

Application Category

📝 Abstract

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3x and reduces prompting and decoding latency by up to 66% and 24%, respectively, compared to existing approaches. Helix is available at https://github.com/Thesys-lab/Helix-ASPLOS25.

Problem

Research questions and friction points this paper is trying to address.

Optimizes large language model serving in heterogeneous GPU clusters.

Formulates LLM inference as a max-flow problem on weighted graphs.

Improves throughput and reduces latency in LLM serving systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Max-flow problem formulation for LLM serving

Mixed integer linear programming for optimization

Joint optimization of model placement and scheduling

🔎 Similar Papers

No similar papers found.