FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing context parallelism approaches suffer from load imbalance, redundant communication, and suboptimal computational efficiency. This work proposes a communication-efficient and load-balanced context parallel training framework featuring three key innovations: a Whole-Doc document-level sharding strategy integrated with a Per-Doc hybrid sharding mechanism, a shard-aware communication protocol that effectively eliminates redundant transmission of KV caches, and a heuristic search-based near-optimal sharding planning algorithm. Experimental results demonstrate that the proposed method achieves up to a 1.63× speedup over state-of-the-art context parallelism techniques across multiple datasets.

📝 Abstract

Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training. FlashCP introduces a sharding-aware communication mechanism to eliminate redundant KV communication and proposes a novel Whole-Doc sharding strategy that maximizes communication savings while maintaining balanced workloads. To efficiently combine Whole-Doc and Per-Doc sharding, FlashCP further designs a heuristic algorithm to search for near-optimal sharding plans. Extensive experiments show that FlashCP achieves up to 1.63x speedup over state-of-the-art CP frameworks across diverse datasets.

Problem

Research questions and friction points this paper is trying to address.

context parallelism

workload imbalance

redundant communication

KV tensor

sequence sharding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Parallelism

Load Balancing

Communication Efficiency