Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

To address high token-to-token (TTL) latency and low throughput in real-time autoregressive decoding of LLMs over ultra-long contexts (million-token scale)—caused primarily by KV-cache read bottlenecks and frequent FFN weight accesses—this paper proposes a hybrid parallel architecture that decouples attention and feed-forward computations. Our method introduces: (1) a novel dynamic switching mechanism between KV parallelism and tensor/expert parallelism, eliminating redundant KV-head replication; and (2) synergistic optimizations including KV sharding, lightweight Helix HOP-B inter-GPU communication, and intra-batch computation-communication overlap. Experiments on the Blackwell platform demonstrate that, under identical TTL constraints, our approach increases batch size by 32× and reduces TTL latency by 1.5×. It significantly expands the throughput–latency Pareto frontier and, for the first time, enables low-latency, high-throughput real-time inference at million-token context lengths.

Technology Category

Application Category

📝 Abstract

As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size. Simultaneously, DRAM reads for long KV histories scale linearly with batch size, further capping efficiency. We introduce Helix Parallelism, a hybrid execution strategy that applies KV parallelism during attention to shard KV caches across GPUs, then reuses the same GPUs for TP in dense LLMs or TPxExpert Parallel (EP) in MoEs during FFN computation. To preserve exact attention behavior, Helix includes a lightweight communication step. To minimize the exposed communication cost, we introduce Helix HOP-B. Helix HOP-B effectively minimizes communication overhead through batchwise overlap, preserving low TTL while improving GPU efficiency. Compared to conventional parallelism approaches, Helix reduces TTL by up to 1.5x at fixed batch sizes and supports up to 32x larger batches under the same latency budget for DeepSeek-R1, pushing forward the throughput-latency Pareto on Blackwell and making real-time inference with ultra-long-sequence practical.

Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache duplication in multi-million-token LLM decoding

Minimizing DRAM reads for long KV histories during decoding

Improving throughput-latency tradeoff in real-time LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid KV and Tensor Parallelism strategy

Lightweight communication for exact attention

Batchwise overlap minimizes communication overhead

🔎 Similar Papers

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations