Context Parallelism for Scalable Million-Token Inference

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

220K/year

🤖 AI Summary

To address the severe non-linear scaling of prefill latency with context length—particularly prohibitive for million-token contexts in large language model inference—this paper introduces a context-parallel architecture augmented with lossless ring attention variants (pass-KV/pass-Q), enabling unified support for full prefill, persistent KV prefill, and decoding. The method integrates cross-GPU/cross-node context partitioning, RDMA/TCP-hybrid communication scheduling, and multi-node distributed tensor parallelism on H100 clusters. Evaluated on Llama3-405B, it achieves 77 seconds for 1M-token prefill (93% parallel efficiency) and 3.8 seconds for 128K tokens, maintaining strong scalability even in commodity data centers with moderate bandwidth. The core contribution is a novel long-context inference paradigm that simultaneously ensures numerical fidelity, operational generality across inference modes, and near-linear scalability.

Technology Category

Application Category

📝 Abstract

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.

Problem

Research questions and friction points this paper is trying to address.

Enables million-token inference with near-linear scaling

Reduces long-context prefill latency using 128 GPUs

Supports diverse use cases with lossless exact attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context parallelism for million-token inference

Lossless exact ring attention variants

Scalable across 128 GPUs with high efficiency

🔎 Similar Papers

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations