Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

📅 2024-09-25

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Large language models (LLMs) face significant challenges—including high time-to-first-token (TTFT) and time-between-tokens (TBT), head-of-line blocking, and low hardware utilization—when performing interactive inference over ultra-long contexts (up to 10M tokens). Method: We propose the first lossless long-context inference system, built upon a novel 3D parallel architecture integrating adaptive chunking, sequence-level pipeline parallelism (SPP), and KV-cache parallelism (KVP), coupled with hybrid request batching and memory-compute co-scheduling. Contribution/Results: Our system achieves production-grade performance on 10M-token contexts: TBT ≤ 30 ms, substantially reduced TTFT, and improved throughput—all without approximation, compression, or accuracy loss. To our knowledge, it is the only inference service system supporting multi-million-token contexts while simultaneously ensuring zero precision degradation, high throughput, and low latency.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to First Token (TTFT) and Time Between Tokens (TBT). Furthermore, there are no long context inference solutions that allow batching requests to increase the hardware utilization today. In this paper, we propose three key innovations for efficient interactive long context LLM inference, without resorting to any approximation: adaptive chunking to reduce prefill overheads in mixed batching, Sequence Pipeline Parallelism (SPP) to lower TTFT, and KV Cache Parallelism (KVP) to minimize TBT. These contributions are combined into a 3D parallelism strategy, enabling Mnemosyne to scale interactive inference to context lengths at least up to 10 million tokens with high throughput enabled with batching. To our knowledge, Mnemosyne is the first to be able to achieve support for 10 million long context inference efficiently, while satisfying production-grade SLOs on TBT (30ms) on contexts up to and including 10 million.

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multi-million token context LLM inference requests

Addressing varying prefill and decode phase latency constraints

Eliminating head-of-line blocking in long-context inference systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive chunking with slack-aware scheduling

Sequence Pipeline Parallelism (SPP)

KV Cache Parallelism (KVP)

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval