🤖 AI Summary
This work addresses the inefficiencies in pipeline parallel training, where synchronous methods suffer from bubble-induced idle time and asynchronous approaches introduce weight version inconsistency. The authors propose PACI, a novel method that explicitly controls the evolution rate of parameter versions through local gradient accumulation. PACI enables bubble-free asynchronous pipeline execution without requiring weight stashing, prediction, or global synchronization, while strictly bounding weight drift between forward and backward passes. Notably, it achieves high throughput and training stability without additional memory overhead or extra parameter copies. In pretraining GPT-style models, PACI matches the final perplexity and peak memory usage of synchronous 1F1B-flush, attains full pipeline utilization, and reduces training time to target accuracy by up to 1.69×.
📝 Abstract
Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to $1.69\times$ over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.